Thank you, Stefan, for this advice. I shared it with a friend, who is much more 
expert than I. He suggested a variant of your advice,  adding "tree =None" 
after the tree.write command  whenever files are processed in a loop. This 
appears to work

On 3/26/21, 12:54 PM, "Stefan Behnel" <stefan...@behnel.de> wrote:

    Martin Mueller schrieb am 17.03.21 um 18:35:
    > I reported the following problem some months ago, but didn’t get (or 
missed) an answer. Here it is again. I’m not sure whether it is in fact an lxml 
problem, but it only occurs in one particular lxml script. That script ran 
without problems for about a year, but suddenly stopped working.  It will now 
run properly through any individual file, but when run it on a sequence of 
files it will fail after a dozen or so files with a “memory allocation failed” 
message. If you start from the file on which it failed it will process that 
file properly, but fail after processing some files with the same error message.
    > 
    > I run Python 3.7 in a conda environment in Pycharm.  The failure is 
produced by a function that sorts attributes alphabetically and indents a TEI 
XML file in which every token is wrapped in a <w> element that contains between 
three and eight attributes.  The files get edited a lot. We keep the attributes 
sorted to make it easier to recognize substantive changes or additions.
    > 
    > def sort_and_indent(elem, level: int = 0):
    >     attrib = elem.attrib
    >     if len(attrib) > 1:
    >         attributes = sorted(attrib.items())
    >         attrib.clear()
    >         attrib.update(attributes)
    > 
    >     i = "\n" + " " * level
    > 
    >     if len(elem):
    >         if not elem.text or not elem.text.strip():
    >             elem.text = i + " "
    >         if not elem.tail or not elem.tail.strip():
    >             elem.tail = i
    
    >         for elem in elem:
    >             sort_and_indent(elem, level + 1)
    >         if not elem.tail or not elem.tail.strip():
    >             elem.tail = i
    
    The last part reads a bit dangerous since it overwrites the "elem" variable
    in the loop. It probably works ok – it's just requires at least a second
    look to understand what it does. And it's risky if you ever end up adding
    more functionality at the end of the function that still needs the original
    "elem" value.
    
    
    >     else:
    >         if level and (not elem.tail or not elem.tail.strip()):
    >             elem.tail = i
    > 
    > When the function fails it produces this error message:
    > 
    > /Users/martinmueller/.conda/envs/earlyprintprocessing/bin/python 
/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py
    > Traceback (most recent call last):
    >   File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", 
line 71, in <module>
    >     do_etree(filename, item, counter)
    >   File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", 
line 49, in do_etree
    >     tree = etree.parse(filename, parser)
    >   File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
    >   File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument
    >   File "src/lxml/parser.pxi", line 1866, in 
lxml.etree._parseDocumentFromURL
    >   File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile
    >   File "src/lxml/parser.pxi", line 1163, in 
lxml.etree._BaseParser._parseDocFromFile
    >   File "src/lxml/parser.pxi", line 601, in 
lxml.etree._ParserContext._handleParseResultDoc
    >   File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
    >   File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
    >   File 
"/users/martinmueller/dropbox/eebochron/1470-1600/159/a/159-adp-A12229.xml", 
line 10137
    > lxml.etree.XMLSyntaxError: Memory allocation failed, line 10137, column 24
    
    The first thing I notice here is that the failure is not in the function
    that you showed us, but already at the point where it parses the file.
    
    
    > The error message appears to be generated by lxml, but it may not be an 
lxml problem. I checked memory usage on the Activity Monitor of my Mac, has 
64GB of memory. Memory usage by Python goes beyond 2GB, but the point of 
failure doesn’t seem to be related to the memory usage that is reported:  it 
keeps running at well over 2 GB in one batch of files, but in another run it 
fails at well below 2GB.
    
    Not sure how Macs are set up here, but try to make sure that you are using
    a 64bit Python installation and not a 32bit one.
    
    A 64bit system would output this:
    
        python3 -c 'import sys; print(sys.maxsize)'
        9223372036854775807
    
    or the same for Python 2.x:
    
        python2 -c 'import sys; print(sys.maxint)'
        9223372036854775807
    
    
    > I cannot associate the onset of this problem with any particular event.  
I thought it could have something to do with a Pycharm update, but I just ran 
the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same 
error. In running the script twice from the first file in a batch, I noticed 
that it failed  at exactly the same point in the same file. But it cannot be a 
function of that file because if you run the program starting with that file it 
will process it properly.
    
    Is there anything special about the file that it's trying to parse here?
    How big are these files (uncompressed)?
    
    
    > It does appear, however, that something cumulative is going on: in moving 
from one file to the next, the script does not start from scratch, but keeps or 
fails to clear some memory that causes a failure when a trigger point is 
reached.
    
    What happens to the data after parsing and processing one file? Does it get
    cleaned up before parsing the next one? You might need to "del" some
    variables before starting the next loop iteration, to make sure that the
    XML tree really gets released *before* parsing the next one, and not just
    by overwriting the variable *after* parsing it. That's a common issue with
    automatic memory management that can easily (and needlessly) lead to twice
    the memory usage for a program.
    
    Stefan
    _______________________________________________
    lxml - The Python XML Toolkit mailing list -- lxml@python.org
    To unsubscribe send an email to lxml-le...@python.org
    
https://urldefense.com/v3/__https://mail.python.org/mailman3/lists/lxml.python.org/__;!!Dq0X2DkFhyF93HkjWTBQKhk!HdrUJujP2d-GtE5Tp0dJaRecmDDqipzaD36d7S80H9uqfxghs4nC3JekHGdytCQopPdMVgCRBA$
 
    Member address: martinmuel...@northwestern.edu
    

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to