On Wed, 2021-06-02 at 16:43 +0200, Wouter De Borger wrote: > I'm chasing an elusive memory leak and it might be related to lxml. > I hope you can help me to understand it better. > When I parse a large XML file, and let it get garbage collected, > memory is not freed up:
I suspect the issue here is "freed". What you demonstrate does not indicate the memory is not being freed; if it is freed it is available to the process for re-use. You are measuring the allocation OF THE PROCESS [allocated] not of the memory being USED. malloc + free does not typically return memory to the OS, it only makes if free in the address space OF THE PROCESS. This is not a bug, it is how modern operating systems [UNIX/LINUX at least] operate within the generous address spaces of 32 & 64 bit processors. If monitoring the working size of a process was what I wanted I would choose rss of data in the process object. But no available value is really memory-used. > E.g. when I run following code: > > import logging > import psutil > import os > import humanize > import gc > > LOGGER = logging.getLogger(__name__) > > def get_memory_usage(process: psutil.Process) -> int: > with process.oneshot(): > return process.memory_full_info().data > > > def log_mem_diff(process: psutil.Process, message: str) -> int: > usage = get_memory_usage(process) > LOGGER.error(f"{message}: {humanize.naturalsize(usage)}") > return usage > > process = psutil.Process(os.getpid()) > > import xml.etree as etree > import xml.etree.ElementTree > def build_tree(xml): > tree = etree.ElementTree.fromstring(xml) > log_mem_diff(process, "In_scope") > # tree goes out of scope here > > # import lxml.etree as etree > > # def build_tree(xml): > # parser = etree.XMLParser(remove_blank_text=True, > collect_ids=False) > # tree = etree.XML(xml, parser) > # log_mem_diff(process, "In_scope") > > with open("junos-conf-root.xml", "r") as f: > xml = f.read() > > for i in range(0, 5): > build_tree(xml) > log_mem_diff(process, "before gc") > > gc.collect() > log_mem_diff(process, "after gc") > > > > > I get > > In_scope: 1.4 GB > before gc: 1.4 GB > after gc: 1.4 GB > In_scope: 1.7 GB > before gc: 1.7 GB > after gc: 1.7 GB > In_scope: 1.7 GB > before gc: 1.7 GB > after gc: 1.7 GB > In_scope: 1.7 GB > before gc: 1.7 GB > after gc: 1.7 GB > In_scope: 1.7 GB > before gc: 1.7 GB > after gc: 1.7 GB > > This is not a leak per-se, but it behaves unexpectedly in that > 1. memory usage goes up > 2. running the GC doesn't reduce it > 2. running the code again, it doesn't keep going up. > > I'm trying to understand this behavior. > Could you be of assistance in this? > > Python : sys.version_info(major=3, minor=8, micro=9, > releaselevel='final', serial=0) > lxml.etree : (4, 6, 3, 0) > libxml used : (2, 9, 10) > libxml compiled : (2, 9, 10) > libxslt used : (1, 1, 34) > libxslt compiled : (1, 1, 34) > > Wouter > > _______________________________________________ > lxml - The Python XML Toolkit mailing list -- > lxml@python.org > > To unsubscribe send an email to > lxml-le...@python.org > > https://mail.python.org/mailman3/lists/lxml.python.org/ > > Member address: > awill...@whitemice.org > > -- Adam Tauno Williams <mailto:awill...@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/> _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com