[Alan Kennedy] >>SAX is perfect for the job. See code below. [Fredrik Lundh] > depends on your definition of perfect...
Obviously, perfect is the eye of the beholder ;-) [Fredrik Lundh] > using a 20 MB version of jog's sample, and having replaced > the print statements with local variable assignments, I get the > following timings: > > 5 lines of cElementTree code: 7.2 seconds > 60+ lines of xml.sax code: 63 seconds > > (Python 2.4.1, Windows XP, Pentium 3 GHz) Impressive! At first, I thought your code sample was building a tree for the entire document, so I checked the API docs. It appeared to me that an event processing model *couldn't* obtain the text for the node when notified of the node: the text node is still in the future. That's when I understood the nature of iterparse, which must generate an event *after* the node is complete, and it's subdocument reified. That's also when I understood the meaning of the "elem.clear()" call at the end. Only the required section of the tree is modelled in memory at any given time. Nice. There are some minor inefficiencies in my pure python sax code, e.g. building the stack expression for every evaluation, but I left them in for didactic reasons. But even if every possible speed optimisation was added to my python code, I doubt it would be able to match your code. I'm guessing that a lot of the reason why cElementTree performs best is because the model-building is primarily implemented in C: Both of our solutions run python code for every node in the tree, i.e. are O(N). But yours also avoids the overhead of having function-calls/stack-frames for every single node event, by processing all events inside a single function. If the SAX algorithm were implemented in C (or Java) for that matter, I wonder if it might give comparable performance to the cElementTree code, primarily because the data structures it is building are simpler, compared to the tree-subsections being reified and discarded by cElementTree. But that's not of relevance, because we're looking for python solutions. (Aside: I can't wait to run my solution on a fully-optimising PyPy :-) That's another nice thing I didn't know (c)ElementTree could do. enlightened-ly'yrs, -- alan kennedy ------------------------------------------------------ email alan: http://xhaus.com/contact/alan -- http://mail.python.org/mailman/listinfo/python-list