Hello everyone,

I've been cracking my head about this performance issue I'm having and I
could use some help.

At my work we have to parse extremely large XML files - 20GB and even
larger. The basic algorithm is as follows:

with open(file, "rb") as reader:
   context = etree.iterparse(reader, events=('start', 'end'))
   for ev, el in context:
      (processing)
      el.clear()

In Python 2.7, the processing time for a 20GB XML file is approximately 40
minutes.

In Python 3.13, it's 7 hours, more than ten times from Python 2.

We went through a fine-toothed comb to find the reason why (there were
minimal changes in the porting process), and out of desperation I commented
out the el.clear() line, and apparently that is the reason - without it,
performance in Python 3 matches with 2.

Unfortunately when we tested this in a less well-endowed server, the
program crashed due to running out of memory (it worked fine with Python
2).

I tried substituting el.clear() with del el instead but it did not work -
apparently there were still references somewhere, so the garbage collector
didn't fire.

Questions:

1. What is the difference between the Python 2 and Python 3's
implementation of clear()?

2. Is there a way to solve this issue of performance penalty? I tried
fast_iter, clearing the root element, re-assigning element to None, nothing
works.

Any help would be greatly appreciated.

Regards,
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com
  • [lxml] Performance issue... Noorulamry Daud via lxml - The Python XML Toolkit
    • [lxml] Re: Performa... Stefan Behnel via lxml - The Python XML Toolkit
      • [lxml] Re: Perf... Charlie Clark
        • [lxml] Re: ... Noorulamry Daud via lxml - The Python XML Toolkit
          • [lxml] ... Stefan Behnel via lxml - The Python XML Toolkit
            • [l... Charlie Clark
              • ... Noorulamry Daud via lxml - The Python XML Toolkit
                • ... Xavier Morel via lxml - The Python XML Toolkit
                • ... Noorulamry Daud via lxml - The Python XML Toolkit

Reply via email to