Hi Adam, Some more news on this:
I ran valgrind, and it is indeed as you suggest: the memory is not freed to the OS. One of my colleagues found this: http://xmlsoft.org/xmlmem.html You may encounter that your process using libxml2 does not have a reduced > memory usage although you freed the trees. This is because libxml2 > allocates memory in a number of small chunks. When freeing one of those > chunks, the OS may decide that giving this little memory back to the kernel > will cause too much overhead and delay the operation. As all chunks are > this small, they get actually freed but not returned to the kernel. On > systems using glibc, there is a function call "malloc_trim" from malloc.h > which does this missing operation (note that it is allowed to fail). *Thus, > after freeing your tree you may simply try "malloc_trim(0);"* to really > get the memory back. If your OS does not provide malloc_trim, try searching > for a similar function. > I added this code: import ctypes def trim_memory() -> int: libc = ctypes.CDLL("libc.so.6") return libc.malloc_trim(0) This seems to fix it! Perhaps it would be good if lxml would do this by default? Wouter On Wed, 2 Jun 2021 at 21:23, Adam Tauno Williams <awill...@whitemice.org> wrote: > On Wed, 2021-06-02 at 16:43 +0200, Wouter De Borger wrote: > > I'm chasing an elusive memory leak and it might be related to lxml. > > I hope you can help me to understand it better. > > When I parse a large XML file, and let it get garbage collected, > > memory is not freed up: > > I suspect the issue here is "freed". What you demonstrate does not > indicate the memory is not being freed; if it is freed it is available > to the process for re-use. You are measuring the allocation OF THE > PROCESS [allocated] not of the memory being USED. > > malloc + free does not typically return memory to the OS, it only makes > if free in the address space OF THE PROCESS. > > This is not a bug, it is how modern operating systems [UNIX/LINUX at > least] operate within the generous address spaces of 32 & 64 bit > processors. > > If monitoring the working size of a process was what I wanted I would > choose rss of data in the process object. But no available value is > really memory-used. > > > E.g. when I run following code: > > > > import logging > > import psutil > > import os > > import humanize > > import gc > > > > LOGGER = logging.getLogger(__name__) > > > > def get_memory_usage(process: psutil.Process) -> int: > > with process.oneshot(): > > return process.memory_full_info().data > > > > > > def log_mem_diff(process: psutil.Process, message: str) -> int: > > usage = get_memory_usage(process) > > LOGGER.error(f"{message}: {humanize.naturalsize(usage)}") > > return usage > > > > process = psutil.Process(os.getpid()) > > > > import xml.etree as etree > > import xml.etree.ElementTree > > def build_tree(xml): > > tree = etree.ElementTree.fromstring(xml) > > log_mem_diff(process, "In_scope") > > # tree goes out of scope here > > > > # import lxml.etree as etree > > > > # def build_tree(xml): > > # parser = etree.XMLParser(remove_blank_text=True, > > collect_ids=False) > > # tree = etree.XML(xml, parser) > > # log_mem_diff(process, "In_scope") > > > > with open("junos-conf-root.xml", "r") as f: > > xml = f.read() > > > > for i in range(0, 5): > > build_tree(xml) > > log_mem_diff(process, "before gc") > > > > gc.collect() > > log_mem_diff(process, "after gc") > > > > > > > > > > I get > > > > In_scope: 1.4 GB > > before gc: 1.4 GB > > after gc: 1.4 GB > > In_scope: 1.7 GB > > before gc: 1.7 GB > > after gc: 1.7 GB > > In_scope: 1.7 GB > > before gc: 1.7 GB > > after gc: 1.7 GB > > In_scope: 1.7 GB > > before gc: 1.7 GB > > after gc: 1.7 GB > > In_scope: 1.7 GB > > before gc: 1.7 GB > > after gc: 1.7 GB > > > > This is not a leak per-se, but it behaves unexpectedly in that > > 1. memory usage goes up > > 2. running the GC doesn't reduce it > > 2. running the code again, it doesn't keep going up. > > > > I'm trying to understand this behavior. > > Could you be of assistance in this? > > > > Python : sys.version_info(major=3, minor=8, micro=9, > > releaselevel='final', serial=0) > > lxml.etree : (4, 6, 3, 0) > > libxml used : (2, 9, 10) > > libxml compiled : (2, 9, 10) > > libxslt used : (1, 1, 34) > > libxslt compiled : (1, 1, 34) > > > > Wouter > > > > _______________________________________________ > > lxml - The Python XML Toolkit mailing list -- > > lxml@python.org > > > > To unsubscribe send an email to > > lxml-le...@python.org > > > > https://mail.python.org/mailman3/lists/lxml.python.org/ > > > > Member address: > > awill...@whitemice.org > > > > > -- > Adam Tauno Williams <mailto:awill...@whitemice.org> GPG D95ED383 > OpenGroupware Developer <http://www.opengroupware.us/> > > _______________________________________________ > lxml - The Python XML Toolkit mailing list -- lxml@python.org > To unsubscribe send an email to lxml-le...@python.org > https://mail.python.org/mailman3/lists/lxml.python.org/ > Member address: wou...@inmanta.com > -- Wouter De Borger Chief Architect Inmanta +32479474994 <0479474994> wouter.debor...@inmanta.com www.inmanta.com Kapeldreef 60, 3001 Heverlee [image: twitter] <https://twitter.com/wdeborger> [image: linkedin] <https://www.linkedin.com/in/wouter-de-borger-a720507/>
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com