Martin Mueller schrieb am 17.03.21 um 18:35:
> I reported the following problem some months ago, but didn’t get (or missed) 
> an answer. Here it is again. I’m not sure whether it is in fact an lxml 
> problem, but it only occurs in one particular lxml script. That script ran 
> without problems for about a year, but suddenly stopped working.  It will now 
> run properly through any individual file, but when run it on a sequence of 
> files it will fail after a dozen or so files with a “memory allocation 
> failed” message. If you start from the file on which it failed it will 
> process that file properly, but fail after processing some files with the 
> same error message.
> 
> I run Python 3.7 in a conda environment in Pycharm.  The failure is produced 
> by a function that sorts attributes alphabetically and indents a TEI XML file 
> in which every token is wrapped in a <w> element that contains between three 
> and eight attributes.  The files get edited a lot. We keep the attributes 
> sorted to make it easier to recognize substantive changes or additions.
> 
> def sort_and_indent(elem, level: int = 0):
>     attrib = elem.attrib
>     if len(attrib) > 1:
>         attributes = sorted(attrib.items())
>         attrib.clear()
>         attrib.update(attributes)
> 
>     i = "\n" + " " * level
> 
>     if len(elem):
>         if not elem.text or not elem.text.strip():
>             elem.text = i + " "
>         if not elem.tail or not elem.tail.strip():
>             elem.tail = i

>         for elem in elem:
>             sort_and_indent(elem, level + 1)
>         if not elem.tail or not elem.tail.strip():
>             elem.tail = i

The last part reads a bit dangerous since it overwrites the "elem" variable
in the loop. It probably works ok – it's just requires at least a second
look to understand what it does. And it's risky if you ever end up adding
more functionality at the end of the function that still needs the original
"elem" value.


>     else:
>         if level and (not elem.tail or not elem.tail.strip()):
>             elem.tail = i
> 
> When the function fails it produces this error message:
> 
> /Users/martinmueller/.conda/envs/earlyprintprocessing/bin/python 
> /Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py
> Traceback (most recent call last):
>   File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", 
> line 71, in <module>
>     do_etree(filename, item, counter)
>   File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", 
> line 49, in do_etree
>     tree = etree.parse(filename, parser)
>   File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
>   File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument
>   File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
>   File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile
>   File "src/lxml/parser.pxi", line 1163, in 
> lxml.etree._BaseParser._parseDocFromFile
>   File "src/lxml/parser.pxi", line 601, in 
> lxml.etree._ParserContext._handleParseResultDoc
>   File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
>   File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
>   File 
> "/users/martinmueller/dropbox/eebochron/1470-1600/159/a/159-adp-A12229.xml", 
> line 10137
> lxml.etree.XMLSyntaxError: Memory allocation failed, line 10137, column 24

The first thing I notice here is that the failure is not in the function
that you showed us, but already at the point where it parses the file.


> The error message appears to be generated by lxml, but it may not be an lxml 
> problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB 
> of memory. Memory usage by Python goes beyond 2GB, but the point of failure 
> doesn’t seem to be related to the memory usage that is reported:  it keeps 
> running at well over 2 GB in one batch of files, but in another run it fails 
> at well below 2GB.

Not sure how Macs are set up here, but try to make sure that you are using
a 64bit Python installation and not a 32bit one.

A 64bit system would output this:

    python3 -c 'import sys; print(sys.maxsize)'
    9223372036854775807

or the same for Python 2.x:

    python2 -c 'import sys; print(sys.maxint)'
    9223372036854775807


> I cannot associate the onset of this problem with any particular event.  I 
> thought it could have something to do with a Pycharm update, but I just ran 
> the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same 
> error. In running the script twice from the first file in a batch, I noticed 
> that it failed  at exactly the same point in the same file. But it cannot be 
> a function of that file because if you run the program starting with that 
> file it will process it properly.

Is there anything special about the file that it's trying to parse here?
How big are these files (uncompressed)?


> It does appear, however, that something cumulative is going on: in moving 
> from one file to the next, the script does not start from scratch, but keeps 
> or fails to clear some memory that causes a failure when a trigger point is 
> reached.

What happens to the data after parsing and processing one file? Does it get
cleaned up before parsing the next one? You might need to "del" some
variables before starting the next loop iteration, to make sure that the
XML tree really gets released *before* parsing the next one, and not just
by overwriting the variable *after* parsing it. That's a common issue with
automatic memory management that can easily (and needlessly) lead to twice
the memory usage for a program.

Stefan
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to