Thanks for looking into this ( no complaints here! ).  Indeed using the 
chunkshape parameter with pytables2.0 greatly improved the memory usage and 
overall speed.  

However...

Switching to pytables2.0 brings a few wrinkles.  Specifically there seems to be 
something about numpy string handling that yeilds extra 'junk' at the end of 
the string when the string column is referenced out, this code..

from tables import *

fp = openFile( "foo", 'w' )

StringCol( itemsize=20,)
table = fp.createTable( fp.root, 'title',
                        { 'var1' : StringCol( itemsize=20) }, '')
table.append( [['abc']])
fp.flush()

b = table.read()
print b
print b['var1']


yeilds this output..

[('abc',)]
['[EMAIL PROTECTED]']
Closing remaining open files: foo... done

Do you have any idea what this is?  This is with numpy 1.0.1.

thanks,
Stefan


----- Original Message ----
From: Francesc Altet <[EMAIL PROTECTED]>
To: Stefan Kuzminski <[EMAIL PROTECTED]>
Cc: PyTables user list <pytables-users@lists.sourceforge.net>
Sent: Monday, March 26, 2007 3:01:55 PM
Subject: Re: [Pytables-users] memory usage while appending tables (with sample 
code)

El dl 26 de 03 del 2007 a les 18:17 +0200, en/na Francesc Altet va
escriure:
> Well, it is not that easy. I was fooled by the strange behaviour of
> range() (in terms of memory consumption), but there exists a real
> problem that (I think) I've traced down to the HDF5 library (H5Dwrite
> function in particular). I'm going to study this more carefully and, if
> appropriate, report the problem to the HDF5 maintainers.
> 
> I'll come back with more info about this issue.

I've done my small research about this and here is the conclusion: the
growth of memory consumption is basically due to the growth of the
B-tree that HDF5 keeps in memory for accelerating the data access (so,
it is *not* a leak). From the HDF5 manual:

"""
HDF5 takes the data in bunches of chunksize length to write the on disk.
A BTree in memory is used to map structures on disk. The more chunks
that are allocated for a dataset the larger the B-tree. Large B-trees
take memory and causes file storage overhead as well as more disk I/O
and higher contention for the meta data cache.  You have to balance
between memory and I/O overhead (small B-trees) and time to access to
data (big B-trees).
"""

So, I had completely forgotten about this (doh!). In fact, PyTables
already provides logic for computing the optimal chunksize for chunked
datasets, although you should help it a bit: all of the constructors for
chunked datasets have a 'expectedrows' (or equivalent) parameter that
allows you passing an estimation of the final size of your dataset
(table in this case).

Unfortunately, in your example, providing such a guess doesn't help too
much to reduce the memory growth. This is because, in these times of
machines with plenty of memory avalilable, I priorized the seek times
(i.e. the time to retrieve a particular row in the table) at the cost of
more memory consumption. Of course, if you don't like this tuning, you
can try 'fooling' PyTables by telling it that you have many more rows
that you really have. But, frankly, this is not a very elegant solution.

Fortunately, with the advent of PyTables 2.0 you will be able to set, in
a direct way, the chunksize (parameter 'chunkshape' in constructors)
that fits better to your problem. For example, if for you the seek times
are not very important, but memory consumption is, then you can try to
enlarge the chunksize of your dataset. Inversely, if the seek time is
the most important parameter, then you should reduce its chunksize. In
your case, try setting the 'chunkshape' to (10,) to see how the the
memory consumption is greatly reduced (as well as the writing speed).

Of course, the use of such a 'chunkshape' parameter is only meant for
expert users. Other users should keep using 'expectedrows' that deliver
reasonable seek-time/memory-consumption ratios.

Well, let's hope that I'll remember about the B-tree issue the next time
that another user complains ;)

Cheers,

-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users





 
____________________________________________________________________________________
Never miss an email again!
Yahoo! Toolbar alerts you the instant new Mail arrives.
http://tools.search.yahoo.com/toolbar/features/mail/

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to