Hi Dave,

Thanks for the prompt reply. Yes, you are right - I meant the parser being
used in the code generated by generateDS. And yes, it has to use some DOM
API. So, there could be an additional choice to use lxml as the API apart
from the default minidom. The good thing is that you have written the
generateDS module in such a way that the code emitted by it is pretty
generic - so not much change is required in the logic itself, as I see it.
The bad thing is that the structure changes in the two kind of nodes is
large.

For example, the lxml Node API is more light-weight and has just 4 tags (no
other attributes or functions) - .tag, .text, .tail, .attrib (
http://infohost.nmt.edu/tcc/help/pubs/pylxml/etree-view.html). The structure
is simple and intuitive - children are the elements of a list and
"attributes" is a dictionary. On the other hand minidom has various types of
nodes ("ELEMENT_NODE", "TEXT_NODE") and each has some specific functions -
so there are checks required.

As for the resource usage, I found that the approximate split in parsing
(Step 1) and traversing (Step 2) is about 80:20. Speed-up for Step 1 would
be around 20-30 times even conservatively. Step 2 should also speed up by
30-40% since the Node object created is lighter, even though the same Python
code is used (and as i mentioned in the last paragraph, the lxml node
structure is more generic). Plus memory usage by lxml is lower. So, seems
like a great thing.

I will update you with my findings - pls get back to me with any other
questions if you start working on this. :)

Regards,
Biswanath.

On 14 May 2010 08:19, Dave Kuhlman <dkuhl...@rexx.com> wrote:

> >
> >From: Biswanath Patel
> >Sent: Thu, May
> 13, 2010 3:30:06 AM
> >
>
> >
> >
> > I was recently trying to use generateDS with lxml. generateDS by
> > default parses the xml using SAX parser and creates a minidom Node
> > object.  It then traverses the node starting from root, to build
> > the actual required object according to a schema.  Now, the first
> > part (parsing the xml string) can be easily converted to lxml,
> > which returns an lxml etree Node object.  However, I encountered
> > some problems traversing this object with the generateDS code.
> >
> > What I find is that, though the algorithm used is generic and can
> > be used to traverse any kind of node, the code itself is deeply
> > tied to the minidom node.  For example, functions like
> > "getChildren()", attributes like "nodeValue" and "nodeType" and
> > node types like "ELEMENT_NODE" or "TEXT_NODE" have been used, which
> > are specific to minidom but are not found in other node elements -
> > like in the node returned by lxml parsing.
> >
> > The core functionality of the generateDS module should be separated
> > form the type of node being operated on - so that the module
> > becomes node-agnostic - and the same generateDS functions can be
> > integrated with any parsing module - lxml, SAX, or anything else.
> > This is especially important since lxml provides significant
> > improvements in parsing performance (I noticed speed-ups of almost
> > 100 times) compared to minidom, especially for large xmls of over
> > 30-40 MBs.
> >
>
> Biswanath -
>
> I agree with most your points.  But, let me make sure I understand
> what you are saying.
>
> We are talking about the parser used in the code generated by
> generateDS, and *not* about the parser used in generateDS.py
> itself, right?  (The only use of minidom in generateDS.py itself is
> to parse the session file, which is very small.)
>
> I agree, especially now that ElementTree is included in the Python
> standard library, that the generated code should use ElementTree or
> lxml (if lxml it is available, or perhaps as an option).
>
> OK, if we agree so far, then I could modify generateDS.py so that
> it generates code that uses ElementTree or lxml as its XML parser.
>
> However, I don't understand what you mean by "node-agnostic".  The
> generated code has to use some DOM API.  Since the API of
> ElementTree and lxml are, for our purposes, the same, generateDS.py
> could generate code that uses either of them.  Is that what you
> mean?
>
> I do wonder whether we should expect a very large speed up.  You
> are right that lxml is faster than minidom.  But, both have C code
> underneath (expat, in the case of minidom, libxml2 in the case of
> lxml).  I ran a test on a 22 MB document, and lxml seems to be
> faster than minidom by a factor of 1 to 15.  However, if you look
> at the generated code, you will see that the minidom parse is only
> the first phase of building the tree of instances of generated
> classes.  The second phase is all Python code.  I'm guessing that
> is where much of the time goes.
>
> So, give me some time to work on it.  This task will not be at the
> *very* top of my list, but I will work on it.
>
> I'll probably have some questions for you once I looked into it a
> bit more.
>
> - Dave
>
>  --
>
>
> Dave Kuhlman
> http://www.rexx.com/~dkuhlman <http://www.rexx.com/%7Edkuhlman>
>
------------------------------------------------------------------------------

_______________________________________________
generateds-users mailing list
generateds-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/generateds-users

Reply via email to