Hi Anthony, thanks for the explanation and the links, it's much clearer now. So without compression a CArray is really a smarter type of sparse file, but you have to set a sensible chunk shape. Do you know how the default value is set btw? I am asking because I didn't see any change in performance from using the default value and using (1, N), where (N,N) is the shape of the matrix. I guess that the write performance depends crucially on the size of the I/O buffer, so the default must be choosing a similar setting.
Anyway I have played a bit with other values of the chunk shape in conjunction with the compression level and using a shape (1,100) and a complevel=5 gives speeds that are only 10-15% slower than what I get at shape=(1,1) and complevel=0. The resulting file is 10 times smaller, and something like 35 times smaller than a NPY sparse file, btw! Thanks! Giovanni On 06/24/2013 05:25 AM, pytables-users-requ...@lists.sourceforge.net wrote: > Hi Giovanni! > > I think that you may have some misunderstanding about how chucking works, > which is leading you to get terrible performance. In fact what you > describe is a great strategy (right all and zip) for using normal Arrays. > > However, chunking and CArrays don't work like this. If a chunk contains no > data, it is not written at all! Also, all zipping takes place on the chunk > level. Thus for very small chunks you can actually increase the file size > and access time by using compression. > > For sparse matrices and CArrays, you need to play around with the > chunkshape argument to create_carray() and compression. Performance is > going to be affected how dense the matrix is and how grouped it is. For > example, for a very dense and randomly distributed matrix, chunkshape=1 and > no compression is best. For block diagonal matrices, the chunkshape should > be the nominal block shape. Compression is only useful here if the blocks > all have similar values or the block shape is large. For example > > 1 1 0 0 0 0 > 1 1 0 0 0 0 > 0 0 1 1 0 0 > 0 0 1 1 0 0 > 0 0 0 0 1 1 > 0 0 0 0 1 1 > > is well suited to a chunkshape=(2, 2) > > For more information on the HDF model please see my talk slides and video > [1,2] I hope this helps. > > Be Well > Anthony > > PS. Glad to see you using the new API > > 1.https://github.com/scopatz/hdf5-is-for-lovers > 2.http://www.youtube.com/watch?v=Nzx0HAd3FiI -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciam...@indiana.edu ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users