On Sat, 2009-09-19 at 11:57 +0200, Francesc Alted wrote: 
> A Friday 18 September 2009 21:57:58 David Fokkema escrigué:
> > On Fri, 2009-09-18 at 17:07 +0200, Francesc Alted wrote:
> > > A Friday 18 September 2009 16:09:58 David Fokkema escrigué:
> > > > Hi list,
> > > >
> > > > I'm not sure what this is... I've written a minimal script which shows
> > > > the following problem: fill up a table with 10 million rows, which
> > > > costs almost no memory. Then, do the following query:
> > > >
> > > > r = data.root.events.col('event_id')
> > > >
> > > > which brings up memory usage from 14 Mb to 99 Mb. Do it again, which
> > > > brings memory usage further up by tens of Mb's, which are freed after
> > > > the query finishes.
> > >
> > > This is expected.  While the query is executing, the results are being
> > > kept in a new NumPy array.  When the query finishes, the new NumPy object
> > > is bound to the `r` variable, and the old NumPy object pointed by `r` is
> > > released.
> >
> > Ah, yes, of course. Interestingly, it seems that sys.getsizeof doesn't
> > report the size of the NumPy object, but only the reference r? It
> > returns 40 bytes, nothing else.
> 
> I did not know about sys.getsizeof(), but it doesn't seem reliable for 
> getting 
> the size of NumPy arrays.  Maybe it is worth asking the NumPy (or even 
> Python) 
> list about this.

I've discovered it somewhere on a list. Now I've read the documentation
as well, ;-) So, it is only accurate for builtin types and depends on
the implementation for added types, like NumPy arrays. Clearly, that
implementation is either non-existent or flawed.

> 
> > > > Instead, try the following query:
> > > >
> > > > r = [x['event_id'] for x in data.root.events]
> > > >
> > > > which brings memory usage from 14 Mb to 296 Mb. Do it again, which
> > > > brings memory usage up to 528 Mb.
> > >
> > > Expected again.  In this case, you are getting the column as a Python
> > > list, and this takes *far* more space than a regular NumPy array.
> >
> > Ok, but surely not _that_ much space? I end up with a list consisting of
> > 10 million values (longs) which came from a UInt64Col, so should take up
> > about 8 bytes each, so lets say 80 million bytes if python doesn't
> > optimize the small numbers and add some overhead because of the list.
> > Now, sys.getsizeof returns 40 megabytes, which is about what I'd expect.
> > However, that's nowhere near 282 Mb which is taken up by python.
> 
> Again, I don't know why sys.getsizeof returns 40 MB for a list of 10 millions 
> of *long* integers.  Look at this:
> 
> In [40]: sys.getsizeof(1)
> Out[40]: 24

12

> 
> In [41]: sys.getsizeof(1L)
> Out[41]: 32

16

> 
> In [42]: sys.getsizeof(2**62)
> Out[42]: 24

24

> 
> In [43]: sys.getsizeof(2**63)
> Out[43]: 40

24 (yes, even 2**64)

I'm on 32-bit, you're probably on 64-bit? Wow, this is really platform
specific. More so than I'd expect from my C experience.

> So, it is not clear to me how many bytes would take a long integer, but it 
> clearly depends on the number of significant digits.  At any ratem I find 40 
> MB for 10 millions of longs to be a too small figure (?).

Well... See:

>>> a = range(2**70-10, 2**70+10)
>>> sys.getsizeof(a[0])
24
>>> sys.getsizeof(a[0:1])
36
>>> sys.getsizeof(a[0:2])
40
>>> sys.getsizeof(a[0:3])
44
>>> sys.getsizeof(a[0:4])
48

which shows that if sys.getsizeof is correct (which it should be,
according to the official docs) a long only takes up about 4 bytes
(probably indeed some nice power-of-two representation). There is some
overhead for a list, there is some overhead for a single long, but
that's about it: 4 bytes per value. That makes a list of 10 million
longs about 40 million bytes, I guess.

> > > > Del-ing objects and imports doesn't clean up memory...
> > >
> > > It should.  How are you deleting objects, and how do you determine that
> > > memory is not being released?
> >
> > Ah, lets see:
> >
> > This script:
> >
> > import tables
> >
> > class Event(tables.IsDescription):
> >     event_id = tables.UInt64Col()
> >     ext_timestamp = tables.UInt64Col(dflt=9999)
> >     other_value = tables.UInt64Col(dflt=9999)
> >
> > def create_tables():
> >     data = tables.openFile('test.h5', 'w', 'PyTables Test')
> >     data.createTable('/', 'events', Event, 'Test Events')
> >
> >     table = data.root.events
> >     tablerow = table.row
> >     for i in xrange(10000000):
> >         tablerow['event_id'] = i
> >         tablerow.append()
> >     table.flush()
> >
> >     data.close()
> >
> > def test_query():
> >     data = tables.openFile('test.h5', 'r')
> >     r = [x['event_id'] for x in data.root.events]
> >     data.close()
> >     return r
> >
> > And this is my log:
> > >>> from test_tables import *
> > >>> create_tables()
> >
> > (now test.h5 is 230 Mb in size and python uses 19 Mb)
> >
> > >>> r = test_query()
> >
> > (now python uses 293 Mb)
> >
> > >>> import sys
> > >>> sys.getsizeof(r)
> >
> > 40764028
> >
> > (which is only 40 Mb, right? That's something I can live with, ;-) )
> >
> > >>> dir()
> >
> > ['Event', '__builtins__', '__doc__', '__name__', '__package__',
> > 'create_tables', 'r', 'sys', 'tables', 'test_query']
> >
> > >>> del Event
> > >>> del create_tables
> > >>> del r
> > >>> del tables
> > >>> del test_query
> > >>> del sys
> > >>> dir()
> >
> > ['__builtins__', '__doc__', '__name__', '__package__']
> >
> > (python still uses 293 Mb...)
> >
> > So... is this strange? Test_query closes the file so there shouldn't be
> > anything floating around related to that... However, there might be
> > something in the C code which is malloc-ing but not freeing memory?
> 
> Mmh, you are not telling us how do you determine the python interpreter 
> memory 
> consumption.  If what you are using are tools like 'top' or 'ps', maybe you 
> are seeing the python's own malloc'ing subsystem working on (for a number of 
> reasons, it is a bit lazy releasing memory).

Just top, yes. I know that python < 2.5 really had a garbage collector
flaw and have seen in simple tests for more recent versions (not
importing anything) that creating lists and del-ing them immediately
releases memory.

> The correct check for a leak would be to put the list creation in a loop and 
> see if memory consumption grows or it stabilizes at the size of the list.  
> For 
> example, memory usage in:
> 
> create_tables()
> for i in range(N):
>     r = test_query()
> 
> is always around 1 GB (on my 64-bit machine), no matter N is 5, 10 or 20.  
> This is a *clear* evidence that a memory leak is not developing here.
> 

Indeed, with N sufficiently large (> 2) it stabilizes. So, at least
there's no malloc without free in the query loop. Furthermore, if I do
t = r to 'copy' the result list and then redo the query, memory size
will balloon again. Since I have only 1 Gb of memory, I redid the tests
with only 1 million rows, and saw that sys.getsizeof report 4 Mb, while
moving the list and redoing the query added 24 Mb to python memory
usage, while del-ing all copies only returned a fraction of that to the
system.

Another strange thing is that:

>>> a = []
>>> for i in xrange(long(10e6)):
...     a.append(long(100e6))
... 
>>> b = []
>>> for i in xrange(long(10e6)):
...     b.append(100000000)
... 
>>> a == b
True
>>> del a
>>> del b

Resident memory usage: after python start: 3M, after 'a': 276M, after 'b': 314M.
After 'del a': 41M, after 'del b': 3M. Sys.getsizeof returns about 40M for both 
'a' and 'b',
consistent with the size of 'b'.

I should probably move this to the python list...

> Hope that helps,

Not yet, but hopefully getting there...

Thanks!

David


------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to