On Wed, 2021-06-02 at 16:43 +0200, Wouter De Borger wrote:
> I'm chasing an elusive memory leak and it might be related to lxml.
> I hope you can help me to understand it better. 
> When I parse a large XML file, and let it get garbage collected,
> memory is not freed up:

I suspect the issue here is "freed".  What you demonstrate does not
indicate the memory is not being freed; if it is freed it is available
to the process for re-use.  You are measuring the allocation OF THE
PROCESS [allocated] not of the memory being USED.

malloc + free does not typically return memory to the OS, it only makes
if free in the address space OF THE PROCESS.  

This is not a bug, it is how modern operating systems [UNIX/LINUX at
least] operate within the generous address spaces of 32 & 64 bit
processors.

If monitoring the working size of a process was what I wanted I would
choose rss of data in the process object.  But no available value is
really memory-used.

> E.g. when I run following code:
> 
> import logging
> import psutil
> import os
> import humanize
> import gc
> 
> LOGGER = logging.getLogger(__name__)
> 
> def get_memory_usage(process: psutil.Process) -> int:
>     with process.oneshot():
>         return process.memory_full_info().data
> 
> 
> def log_mem_diff(process: psutil.Process, message: str) -> int:
>     usage = get_memory_usage(process)
>     LOGGER.error(f"{message}: {humanize.naturalsize(usage)}")
>     return usage
> 
> process = psutil.Process(os.getpid())
> 
> import xml.etree as etree
> import xml.etree.ElementTree
> def build_tree(xml):
>     tree = etree.ElementTree.fromstring(xml)
>     log_mem_diff(process, "In_scope")
>     # tree goes out of scope here
> 
> # import lxml.etree as etree
> 
> # def build_tree(xml):
> #     parser = etree.XMLParser(remove_blank_text=True,
> collect_ids=False)
> #     tree = etree.XML(xml, parser)
> #     log_mem_diff(process, "In_scope")
> 
> with open("junos-conf-root.xml", "r") as f:
>     xml = f.read()
> 
> for i in range(0, 5):
>     build_tree(xml)
>     log_mem_diff(process, "before gc")
> 
>     gc.collect()
>     log_mem_diff(process, "after gc")
> 
> 
> 
> 
> I get
> 
> In_scope: 1.4 GB
> before gc: 1.4 GB
> after gc: 1.4 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
> In_scope: 1.7 GB
> before gc: 1.7 GB
> after gc: 1.7 GB
> 
> This is not a leak per-se, but it behaves unexpectedly in that 
> 1. memory usage goes up 
> 2. running the GC doesn't reduce it
> 2. running the code again, it doesn't keep going up. 
> 
> I'm trying to understand this behavior.
> Could you be of assistance in this?
> 
> Python              : sys.version_info(major=3, minor=8, micro=9,
> releaselevel='final', serial=0)
> lxml.etree          : (4, 6, 3, 0)
> libxml used         : (2, 9, 10)
> libxml compiled     : (2, 9, 10)
> libxslt used        : (1, 1, 34)
> libxslt compiled    : (1, 1, 34)
> 
> Wouter
> 
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- 
> lxml@python.org
> 
> To unsubscribe send an email to 
> lxml-le...@python.org
> 
> https://mail.python.org/mailman3/lists/lxml.python.org/
> 
> Member address: 
> awill...@whitemice.org
> 
> 
-- 
Adam Tauno Williams <mailto:awill...@whitemice.org> GPG D95ED383
OpenGroupware Developer <http://www.opengroupware.us/>

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to