A Friday 18 September 2009 21:57:58 David Fokkema escrigué:
> On Fri, 2009-09-18 at 17:07 +0200, Francesc Alted wrote:
> > A Friday 18 September 2009 16:09:58 David Fokkema escrigué:
> > > Hi list,
> > >
> > > I'm not sure what this is... I've written a minimal script which shows
> > > the following problem: fill up a table with 10 million rows, which
> > > costs almost no memory. Then, do the following query:
> > >
> > > r = data.root.events.col('event_id')
> > >
> > > which brings up memory usage from 14 Mb to 99 Mb. Do it again, which
> > > brings memory usage further up by tens of Mb's, which are freed after
> > > the query finishes.
> >
> > This is expected.  While the query is executing, the results are being
> > kept in a new NumPy array.  When the query finishes, the new NumPy object
> > is bound to the `r` variable, and the old NumPy object pointed by `r` is
> > released.
>
> Ah, yes, of course. Interestingly, it seems that sys.getsizeof doesn't
> report the size of the NumPy object, but only the reference r? It
> returns 40 bytes, nothing else.

I did not know about sys.getsizeof(), but it doesn't seem reliable for getting 
the size of NumPy arrays.  Maybe it is worth asking the NumPy (or even Python) 
list about this.

> > > Instead, try the following query:
> > >
> > > r = [x['event_id'] for x in data.root.events]
> > >
> > > which brings memory usage from 14 Mb to 296 Mb. Do it again, which
> > > brings memory usage up to 528 Mb.
> >
> > Expected again.  In this case, you are getting the column as a Python
> > list, and this takes *far* more space than a regular NumPy array.
>
> Ok, but surely not _that_ much space? I end up with a list consisting of
> 10 million values (longs) which came from a UInt64Col, so should take up
> about 8 bytes each, so lets say 80 million bytes if python doesn't
> optimize the small numbers and add some overhead because of the list.
> Now, sys.getsizeof returns 40 megabytes, which is about what I'd expect.
> However, that's nowhere near 282 Mb which is taken up by python.

Again, I don't know why sys.getsizeof returns 40 MB for a list of 10 millions 
of *long* integers.  Look at this:

In [40]: sys.getsizeof(1)
Out[40]: 24

In [41]: sys.getsizeof(1L)
Out[41]: 32

In [42]: sys.getsizeof(2**62)
Out[42]: 24

In [43]: sys.getsizeof(2**63)
Out[43]: 40

So, it is not clear to me how many bytes would take a long integer, but it 
clearly depends on the number of significant digits.  At any ratem I find 40 
MB for 10 millions of longs to be a too small figure (?).

> > > Del-ing objects and imports doesn't clean up memory...
> >
> > It should.  How are you deleting objects, and how do you determine that
> > memory is not being released?
>
> Ah, lets see:
>
> This script:
>
> import tables
>
> class Event(tables.IsDescription):
>     event_id = tables.UInt64Col()
>     ext_timestamp = tables.UInt64Col(dflt=9999)
>     other_value = tables.UInt64Col(dflt=9999)
>
> def create_tables():
>     data = tables.openFile('test.h5', 'w', 'PyTables Test')
>     data.createTable('/', 'events', Event, 'Test Events')
>
>     table = data.root.events
>     tablerow = table.row
>     for i in xrange(10000000):
>         tablerow['event_id'] = i
>         tablerow.append()
>     table.flush()
>
>     data.close()
>
> def test_query():
>     data = tables.openFile('test.h5', 'r')
>     r = [x['event_id'] for x in data.root.events]
>     data.close()
>     return r
>
> And this is my log:
> >>> from test_tables import *
> >>> create_tables()
>
> (now test.h5 is 230 Mb in size and python uses 19 Mb)
>
> >>> r = test_query()
>
> (now python uses 293 Mb)
>
> >>> import sys
> >>> sys.getsizeof(r)
>
> 40764028
>
> (which is only 40 Mb, right? That's something I can live with, ;-) )
>
> >>> dir()
>
> ['Event', '__builtins__', '__doc__', '__name__', '__package__',
> 'create_tables', 'r', 'sys', 'tables', 'test_query']
>
> >>> del Event
> >>> del create_tables
> >>> del r
> >>> del tables
> >>> del test_query
> >>> del sys
> >>> dir()
>
> ['__builtins__', '__doc__', '__name__', '__package__']
>
> (python still uses 293 Mb...)
>
> So... is this strange? Test_query closes the file so there shouldn't be
> anything floating around related to that... However, there might be
> something in the C code which is malloc-ing but not freeing memory?

Mmh, you are not telling us how do you determine the python interpreter memory 
consumption.  If what you are using are tools like 'top' or 'ps', maybe you 
are seeing the python's own malloc'ing subsystem working on (for a number of 
reasons, it is a bit lazy releasing memory).

The correct check for a leak would be to put the list creation in a loop and 
see if memory consumption grows or it stabilizes at the size of the list.  For 
example, memory usage in:

create_tables()
for i in range(N):
    r = test_query()

is always around 1 GB (on my 64-bit machine), no matter N is 5, 10 or 20.  
This is a *clear* evidence that a memory leak is not developing here.

Hope that helps,

-- 
Francesc Alted

------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to