Hi Adam,

Some more news on this:

I ran valgrind, and it is indeed as you suggest: the memory is not freed to
the OS.

One of my colleagues found this: http://xmlsoft.org/xmlmem.html

You may encounter that your process using libxml2 does not have a reduced
> memory usage although you freed the trees. This is because libxml2
> allocates memory in a number of small chunks. When freeing one of those
> chunks, the OS may decide that giving this little memory back to the kernel
> will cause too much overhead and delay the operation. As all chunks are
> this small, they get actually freed but not returned to the kernel. On
> systems using glibc, there is a function call "malloc_trim" from malloc.h
> which does this missing operation (note that it is allowed to fail). *Thus,
> after freeing your tree you may simply try "malloc_trim(0);"* to really
> get the memory back. If your OS does not provide malloc_trim, try searching
> for a similar function.
>

I added this code:


import ctypes
def trim_memory() -> int:
libc = ctypes.CDLL("libc.so.6")
return libc.malloc_trim(0)


This seems to fix it!

Perhaps it would be good if lxml would do this by default?

Wouter

On Wed, 2 Jun 2021 at 21:23, Adam Tauno Williams <awill...@whitemice.org>
wrote:

> On Wed, 2021-06-02 at 16:43 +0200, Wouter De Borger wrote:
> > I'm chasing an elusive memory leak and it might be related to lxml.
> > I hope you can help me to understand it better.
> > When I parse a large XML file, and let it get garbage collected,
> > memory is not freed up:
>
> I suspect the issue here is "freed".  What you demonstrate does not
> indicate the memory is not being freed; if it is freed it is available
> to the process for re-use.  You are measuring the allocation OF THE
> PROCESS [allocated] not of the memory being USED.
>
> malloc + free does not typically return memory to the OS, it only makes
> if free in the address space OF THE PROCESS.
>
> This is not a bug, it is how modern operating systems [UNIX/LINUX at
> least] operate within the generous address spaces of 32 & 64 bit
> processors.
>
> If monitoring the working size of a process was what I wanted I would
> choose rss of data in the process object.  But no available value is
> really memory-used.
>
> > E.g. when I run following code:
> >
> > import logging
> > import psutil
> > import os
> > import humanize
> > import gc
> >
> > LOGGER = logging.getLogger(__name__)
> >
> > def get_memory_usage(process: psutil.Process) -> int:
> >     with process.oneshot():
> >         return process.memory_full_info().data
> >
> >
> > def log_mem_diff(process: psutil.Process, message: str) -> int:
> >     usage = get_memory_usage(process)
> >     LOGGER.error(f"{message}: {humanize.naturalsize(usage)}")
> >     return usage
> >
> > process = psutil.Process(os.getpid())
> >
> > import xml.etree as etree
> > import xml.etree.ElementTree
> > def build_tree(xml):
> >     tree = etree.ElementTree.fromstring(xml)
> >     log_mem_diff(process, "In_scope")
> >     # tree goes out of scope here
> >
> > # import lxml.etree as etree
> >
> > # def build_tree(xml):
> > #     parser = etree.XMLParser(remove_blank_text=True,
> > collect_ids=False)
> > #     tree = etree.XML(xml, parser)
> > #     log_mem_diff(process, "In_scope")
> >
> > with open("junos-conf-root.xml", "r") as f:
> >     xml = f.read()
> >
> > for i in range(0, 5):
> >     build_tree(xml)
> >     log_mem_diff(process, "before gc")
> >
> >     gc.collect()
> >     log_mem_diff(process, "after gc")
> >
> >
> >
> >
> > I get
> >
> > In_scope: 1.4 GB
> > before gc: 1.4 GB
> > after gc: 1.4 GB
> > In_scope: 1.7 GB
> > before gc: 1.7 GB
> > after gc: 1.7 GB
> > In_scope: 1.7 GB
> > before gc: 1.7 GB
> > after gc: 1.7 GB
> > In_scope: 1.7 GB
> > before gc: 1.7 GB
> > after gc: 1.7 GB
> > In_scope: 1.7 GB
> > before gc: 1.7 GB
> > after gc: 1.7 GB
> >
> > This is not a leak per-se, but it behaves unexpectedly in that
> > 1. memory usage goes up
> > 2. running the GC doesn't reduce it
> > 2. running the code again, it doesn't keep going up.
> >
> > I'm trying to understand this behavior.
> > Could you be of assistance in this?
> >
> > Python              : sys.version_info(major=3, minor=8, micro=9,
> > releaselevel='final', serial=0)
> > lxml.etree          : (4, 6, 3, 0)
> > libxml used         : (2, 9, 10)
> > libxml compiled     : (2, 9, 10)
> > libxslt used        : (1, 1, 34)
> > libxslt compiled    : (1, 1, 34)
> >
> > Wouter
> >
> > _______________________________________________
> > lxml - The Python XML Toolkit mailing list --
> > lxml@python.org
> >
> > To unsubscribe send an email to
> > lxml-le...@python.org
> >
> > https://mail.python.org/mailman3/lists/lxml.python.org/
> >
> > Member address:
> > awill...@whitemice.org
> >
> >
> --
> Adam Tauno Williams <mailto:awill...@whitemice.org> GPG D95ED383
> OpenGroupware Developer <http://www.opengroupware.us/>
>
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: wou...@inmanta.com
>


-- 
Wouter De Borger

Chief Architect

Inmanta
+32479474994 <0479474994>
wouter.debor...@inmanta.com
www.inmanta.com
Kapeldreef 60, 3001 Heverlee
[image: twitter] <https://twitter.com/wdeborger>
[image: linkedin] <https://www.linkedin.com/in/wouter-de-borger-a720507/>
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to