I reported the following problem some months ago, but didn’t get (or missed) an 
answer. Here it is again. I’m not sure whether it is in fact an lxml problem, 
but it only occurs in one particular lxml script. That script ran without 
problems for about a year, but suddenly stopped working.  It will now run 
properly through any individual file, but when run it on a sequence of files it 
will fail after a dozen or so files with a “memory allocation failed” message. 
If you start from the file on which it failed it will process that file 
properly, but fail after processing some files with the same error message.

I run Python 3.7 in a conda environment in Pycharm.  The failure is produced by 
a function that sorts attributes alphabetically and indents a TEI XML file in 
which every token is wrapped in a <w> element that contains between three and 
eight attributes.  The files get edited a lot. We keep the attributes sorted to 
make it easier to recognize substantive changes or additions.

def sort_and_indent(elem, level: int = 0):
    attrib = elem.attrib
    if len(attrib) > 1:
        attributes = sorted(attrib.items())
        attrib.clear()
        attrib.update(attributes)

    i = "\n" + " " * level

    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + " "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            sort_and_indent(elem, level + 1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

When the function fails it produces this error message:

/Users/martinmueller/.conda/envs/earlyprintprocessing/bin/python 
/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py
Traceback (most recent call last):
  File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 
71, in <module>
    do_etree(filename, item, counter)
  File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 
49, in do_etree
    tree = etree.parse(filename, parser)
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1163, in 
lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 601, in 
lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File 
"/users/martinmueller/dropbox/eebochron/1470-1600/159/a/159-adp-A12229.xml", 
line 10137
lxml.etree.XMLSyntaxError: Memory allocation failed, line 10137, column 24


The error message appears to be generated by lxml, but it may not be an lxml 
problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB of 
memory. Memory usage by Python goes beyond 2GB, but the point of failure 
doesn’t seem to be related to the memory usage that is reported:  it keeps 
running at well over 2 GB in one batch of files, but in another run it fails at 
well below 2GB.

I cannot associate the onset of this problem with any particular event.  I 
thought it could have something to do with a Pycharm update, but I just ran the 
script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same error. 
In running the script twice from the first file in a batch, I noticed that it 
failed  at exactly the same point in the same file. But it cannot be a function 
of that file because if you run the program starting with that file it will 
process it properly.

It does appear, however, that something cumulative is going on: in moving from 
one file to the next, the script does not start from scratch, but keeps or 
fails to clear some memory that causes a failure when a trigger point is 
reached.



I’d be very grateful for any help of advice on where to look for it.



Martin Mueller
Professor emeritus of English and Classics
Northwestern University








_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
%(real_name)s...@lxml.de

Reply via email to