On Wed, 2021-06-02 at 16:43 +0200, Wouter De Borger wrote:
> I'm chasing an elusive memory leak and it might be related to lxml.
> I hope you can help me to understand it better.
> When I parse a large XML file, and let it get garbage collected,
> memory is not freed up:
I suspect the issue here is "freed". What you demonstrate does not
indicate the memory is not being freed; if it is freed it is available
to the process for re-use. You are measuring the allocation OF THE
PROCESS [allocated] not of the memory being USED.
malloc + free does not typically return memory to the OS, it only makes
if free in the address space OF THE PROCESS.
This is not a bug, it is how modern operating systems [UNIX/LINUX at
least] operate within the generous address spaces of 32 & 64 bit
processors.
If monitoring the working size of a process was what I wanted I would
choose rss of data in the process object. But no available value is
really memory-used.
> E.g. when I run following code:
>
> import logging
> import psutil
> import os
> import humanize
> import gc
>
> LOGGER = logging.getLogger(__name__)
>
> def get_memory_usage(process: psutil.Process) -> int:
> with process.oneshot():
> return process.memory_full_info().data
>
>
> def log_mem_diff(process: psutil.Process, message: str) -> int:
> usage = get_memory_usage(process)
> LOGGER.error(f"{message}: {humanize.naturalsize(usage)}")
> return usage
>
> process = psutil.Process(os.getpid())
>
> import xml.etree as etree
> import xml.etree.ElementTree
> def build_tree(xml):
> tree = etree.ElementTree.fromstring(xml)
> log_mem_diff(process, "In_scope")
> # tree goes out of scope here
>
> # import lxml.etree as etree
>
> # def build_tree(xml):
> # parser = etree.XMLParser(remove_blank_text=True,
> collect_ids=False)
> # tree = etree.XML(xml, parser)
> # log_mem_diff(process, "In_scope")
>
> with open("junos-conf-root.xml", "r") as f:
> xml = f.read()
>
> for i in range(0, 5):
> build_tree(xml)
> log_mem_diff(process, "before gc")
>
> gc.collect()
> log_mem_diff(process, "after gc")
>
>
>
>
> I get
>
> In_scope: 1.4 GB
> before gc: 1.4 GB
> after gc: 1.4 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
>
> This is not a leak per-se, but it behaves unexpectedly in that
> 1. memory usage goes up
> 2. running the GC doesn't reduce it
> 2. running the code again, it doesn't keep going up.
>
> I'm trying to understand this behavior.
> Could you be of assistance in this?
>
> Python : sys.version_info(major=3, minor=8, micro=9,
> releaselevel='final', serial=0)
> lxml.etree : (4, 6, 3, 0)
> libxml used : (2, 9, 10)
> libxml compiled : (2, 9, 10)
> libxslt used : (1, 1, 34)
> libxslt compiled : (1, 1, 34)
>
> Wouter
>
> _______________________________________________
> lxml - The Python XML Toolkit mailing list --
> [email protected]
>
> To unsubscribe send an email to
> [email protected]
>
> https://mail.python.org/mailman3/lists/lxml.python.org/
>
> Member address:
> [email protected]
>
>
--
Adam Tauno Williams <mailto:[email protected]> GPG D95ED383
OpenGroupware Developer <http://www.opengroupware.us/>
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]