On 14 Feb 2025, at 11:12, Stefan Behnel via lxml - The Python XML
Toolkit wrote:
Then you're not cleaning up enough of the XML tree. Some of it remains
in memory after processing it, and thus leads to swapping and long
waiting times.
It's definitely a memory issue. You can write some code to catch memory
use quickly. This is something we wrote for openpyxl while we trying to
"contain" memory use:
```python
import os
import openpyxl
from memory_profiler import memory_usage
def test_memory_use():
"""Naive test that assumes memory use will never be more than 120 %
of
that for first 50 rows"""
folder = os.path.split(__file__)[0]
src = os.path.join(folder, "files", "very_large.xlsx")
wb = openpyxl.load_workbook(src, read_only=True)
ws = wb.active
initial_use = None
for n, line in enumerate(ws.iter_rows(values_only=True)):
if n % 50 == 0:
use = memory_usage(proc=-1, interval=1)[0]
if initial_use is None:
initial_use = use
assert use/initial_use < 1.2
print(n, use)
if __name__ == '__main__':
test_memory_use()
```
You should be able to adapt this for your parser and it'll tell you soon
enough how far in you get before your memory use balloons. If memory
serves I had one problem where I was clearing in the wrong place, which
meant that other elements were sticking around. Thanks to Stefan for
helping me sort it. I think your code made be too aggressive. It might
help to look at the Openpyxl worksheet parser which has to handle what
happens if you do additional processing within nodes.
Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com