Thanks for looking into this ( no complaints here! ). Indeed using the chunkshape parameter with pytables2.0 greatly improved the memory usage and overall speed.
However... Switching to pytables2.0 brings a few wrinkles. Specifically there seems to be something about numpy string handling that yeilds extra 'junk' at the end of the string when the string column is referenced out, this code.. from tables import * fp = openFile( "foo", 'w' ) StringCol( itemsize=20,) table = fp.createTable( fp.root, 'title', { 'var1' : StringCol( itemsize=20) }, '') table.append( [['abc']]) fp.flush() b = table.read() print b print b['var1'] yeilds this output.. [('abc',)] ['[EMAIL PROTECTED]'] Closing remaining open files: foo... done Do you have any idea what this is? This is with numpy 1.0.1. thanks, Stefan ----- Original Message ---- From: Francesc Altet <[EMAIL PROTECTED]> To: Stefan Kuzminski <[EMAIL PROTECTED]> Cc: PyTables user list <pytables-users@lists.sourceforge.net> Sent: Monday, March 26, 2007 3:01:55 PM Subject: Re: [Pytables-users] memory usage while appending tables (with sample code) El dl 26 de 03 del 2007 a les 18:17 +0200, en/na Francesc Altet va escriure: > Well, it is not that easy. I was fooled by the strange behaviour of > range() (in terms of memory consumption), but there exists a real > problem that (I think) I've traced down to the HDF5 library (H5Dwrite > function in particular). I'm going to study this more carefully and, if > appropriate, report the problem to the HDF5 maintainers. > > I'll come back with more info about this issue. I've done my small research about this and here is the conclusion: the growth of memory consumption is basically due to the growth of the B-tree that HDF5 keeps in memory for accelerating the data access (so, it is *not* a leak). From the HDF5 manual: """ HDF5 takes the data in bunches of chunksize length to write the on disk. A BTree in memory is used to map structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and causes file storage overhead as well as more disk I/O and higher contention for the meta data cache. You have to balance between memory and I/O overhead (small B-trees) and time to access to data (big B-trees). """ So, I had completely forgotten about this (doh!). In fact, PyTables already provides logic for computing the optimal chunksize for chunked datasets, although you should help it a bit: all of the constructors for chunked datasets have a 'expectedrows' (or equivalent) parameter that allows you passing an estimation of the final size of your dataset (table in this case). Unfortunately, in your example, providing such a guess doesn't help too much to reduce the memory growth. This is because, in these times of machines with plenty of memory avalilable, I priorized the seek times (i.e. the time to retrieve a particular row in the table) at the cost of more memory consumption. Of course, if you don't like this tuning, you can try 'fooling' PyTables by telling it that you have many more rows that you really have. But, frankly, this is not a very elegant solution. Fortunately, with the advent of PyTables 2.0 you will be able to set, in a direct way, the chunksize (parameter 'chunkshape' in constructors) that fits better to your problem. For example, if for you the seek times are not very important, but memory consumption is, then you can try to enlarge the chunksize of your dataset. Inversely, if the seek time is the most important parameter, then you should reduce its chunksize. In your case, try setting the 'chunkshape' to (10,) to see how the the memory consumption is greatly reduced (as well as the writing speed). Of course, the use of such a 'chunkshape' parameter is only meant for expert users. Other users should keep using 'expectedrows' that deliver reasonable seek-time/memory-consumption ratios. Well, let's hope that I'll remember about the B-tree issue the next time that another user complains ;) Cheers, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users ____________________________________________________________________________________ Never miss an email again! Yahoo! Toolbar alerts you the instant new Mail arrives. http://tools.search.yahoo.com/toolbar/features/mail/ ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users