I reported the following problem some months ago, but didn’t get (or missed) an answer. Here it is again. I’m not sure whether it is in fact an lxml problem, but it only occurs in one particular lxml script. That script ran without problems for about a year, but suddenly stopped working. It will now run properly through any individual file, but when run it on a sequence of files it will fail after a dozen or so files with a “memory allocation failed” message. If you start from the file on which it failed it will process that file properly, but fail after processing some files with the same error message.
I run Python 3.7 in a conda environment in Pycharm. The failure is produced by a function that sorts attributes alphabetically and indents a TEI XML file in which every token is wrapped in a <w> element that contains between three and eight attributes. The files get edited a lot. We keep the attributes sorted to make it easier to recognize substantive changes or additions. def sort_and_indent(elem, level: int = 0): attrib = elem.attrib if len(attrib) > 1: attributes = sorted(attrib.items()) attrib.clear() attrib.update(attributes) i = "\n" + " " * level if len(elem): if not elem.text or not elem.text.strip(): elem.text = i + " " if not elem.tail or not elem.tail.strip(): elem.tail = i for elem in elem: sort_and_indent(elem, level + 1) if not elem.tail or not elem.tail.strip(): elem.tail = i else: if level and (not elem.tail or not elem.tail.strip()): elem.tail = i When the function fails it produces this error message: /Users/martinmueller/.conda/envs/earlyprintprocessing/bin/python /Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py Traceback (most recent call last): File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 71, in <module> do_etree(filename, item, counter) File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 49, in do_etree tree = etree.parse(filename, parser) File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError File "/users/martinmueller/dropbox/eebochron/1470-1600/159/a/159-adp-A12229.xml", line 10137 lxml.etree.XMLSyntaxError: Memory allocation failed, line 10137, column 24 The error message appears to be generated by lxml, but it may not be an lxml problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB of memory. Memory usage by Python goes beyond 2GB, but the point of failure doesn’t seem to be related to the memory usage that is reported: it keeps running at well over 2 GB in one batch of files, but in another run it fails at well below 2GB. I cannot associate the onset of this problem with any particular event. I thought it could have something to do with a Pycharm update, but I just ran the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same error. In running the script twice from the first file in a batch, I noticed that it failed at exactly the same point in the same file. But it cannot be a function of that file because if you run the program starting with that file it will process it properly. It does appear, however, that something cumulative is going on: in moving from one file to the next, the script does not start from scratch, but keeps or fails to clear some memory that causes a failure when a trigger point is reached. I’d be very grateful for any help of advice on where to look for it. Martin Mueller Professor emeritus of English and Classics Northwestern University
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ %(real_name)s...@lxml.de