>
>From: Biswanath Patel
>Sent: Thu, May 
13, 2010 3:30:06 AM
>

> 
> 
> I was recently trying to use generateDS with lxml. generateDS by
> default parses the xml using SAX parser and creates a minidom Node
> object.  It then traverses the node starting from root, to build
> the actual required object according to a schema.  Now, the first
> part (parsing the xml string) can be easily converted to lxml,
> which returns an lxml etree Node object.  However, I encountered
> some problems traversing this object with the generateDS code.
> 
> What I find is that, though the algorithm used is generic and can
> be used to traverse any kind of node, the code itself is deeply
> tied to the minidom node.  For example, functions like
> "getChildren()", attributes like "nodeValue" and "nodeType" and
> node types like "ELEMENT_NODE" or "TEXT_NODE" have been used, which
> are specific to minidom but are not found in other node elements -
> like in the node returned by lxml parsing.
> 
> The core functionality of the generateDS module should be separated
> form the type of node being operated on - so that the module
> becomes node-agnostic - and the same generateDS functions can be
> integrated with any parsing module - lxml, SAX, or anything else. 
> This is especially important since lxml provides significant
> improvements in parsing performance (I noticed speed-ups of almost
> 100 times) compared to minidom, especially for large xmls of over
> 30-40 MBs.
> 

Biswanath -

I agree with most your points.  But, let me make sure I understand
what you are saying.

We are talking about the parser used in the code generated by
generateDS, and *not* about the parser used in generateDS.py
itself, right?  (The only use of minidom in generateDS.py itself is
to parse the session file, which is very small.)

I agree, especially now that ElementTree is included in the Python
standard library, that the generated code should use ElementTree or
lxml (if lxml it is available, or perhaps as an option).

OK, if we agree so far, then I could modify generateDS.py so that
it generates code that uses ElementTree or lxml as its XML parser.

However, I don't understand what you mean by "node-agnostic".  The
generated code has to use some DOM API.  Since the API of
ElementTree and lxml are, for our purposes, the same, generateDS.py
could generate code that uses either of them.  Is that what you
mean?

I do wonder whether we should expect a very large speed up.  You
are right that lxml is faster than minidom.  But, both have C code
underneath (expat, in the case of minidom, libxml2 in the case of
lxml).  I ran a test on a 22 MB document, and lxml seems to be
faster than minidom by a factor of 1 to 15.  However, if you look
at the generated code, you will see that the minidom parse is only
the first phase of building the tree of instances of generated
classes.  The second phase is all Python code.  I'm guessing that
is where much of the time goes.

So, give me some time to work on it.  This task will not be at the
*very* top of my list, but I will work on it.

I'll probably have some questions for you once I looked into it a
bit more.

- Dave

 -- 


Dave Kuhlman
http://www.rexx.com/~dkuhlman

------------------------------------------------------------------------------

_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to