> >From: Biswanath Patel >Sent: Thu, May 13, 2010 3:30:06 AM >
> > > I was recently trying to use generateDS with lxml. generateDS by > default parses the xml using SAX parser and creates a minidom Node > object. It then traverses the node starting from root, to build > the actual required object according to a schema. Now, the first > part (parsing the xml string) can be easily converted to lxml, > which returns an lxml etree Node object. However, I encountered > some problems traversing this object with the generateDS code. > > What I find is that, though the algorithm used is generic and can > be used to traverse any kind of node, the code itself is deeply > tied to the minidom node. For example, functions like > "getChildren()", attributes like "nodeValue" and "nodeType" and > node types like "ELEMENT_NODE" or "TEXT_NODE" have been used, which > are specific to minidom but are not found in other node elements - > like in the node returned by lxml parsing. > > The core functionality of the generateDS module should be separated > form the type of node being operated on - so that the module > becomes node-agnostic - and the same generateDS functions can be > integrated with any parsing module - lxml, SAX, or anything else. > This is especially important since lxml provides significant > improvements in parsing performance (I noticed speed-ups of almost > 100 times) compared to minidom, especially for large xmls of over > 30-40 MBs. > Biswanath - I agree with most your points. But, let me make sure I understand what you are saying. We are talking about the parser used in the code generated by generateDS, and *not* about the parser used in generateDS.py itself, right? (The only use of minidom in generateDS.py itself is to parse the session file, which is very small.) I agree, especially now that ElementTree is included in the Python standard library, that the generated code should use ElementTree or lxml (if lxml it is available, or perhaps as an option). OK, if we agree so far, then I could modify generateDS.py so that it generates code that uses ElementTree or lxml as its XML parser. However, I don't understand what you mean by "node-agnostic". The generated code has to use some DOM API. Since the API of ElementTree and lxml are, for our purposes, the same, generateDS.py could generate code that uses either of them. Is that what you mean? I do wonder whether we should expect a very large speed up. You are right that lxml is faster than minidom. But, both have C code underneath (expat, in the case of minidom, libxml2 in the case of lxml). I ran a test on a 22 MB document, and lxml seems to be faster than minidom by a factor of 1 to 15. However, if you look at the generated code, you will see that the minidom parse is only the first phase of building the tree of instances of generated classes. The second phase is all Python code. I'm guessing that is where much of the time goes. So, give me some time to work on it. This task will not be at the *very* top of my list, but I will work on it. I'll probably have some questions for you once I looked into it a bit more. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman ------------------------------------------------------------------------------ _______________________________________________ generateds-users mailing list generateds-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/generateds-users