Hi Giovanni! I think that you may have some misunderstanding about how chucking works, which is leading you to get terrible performance. In fact what you describe is a great strategy (right all and zip) for using normal Arrays.

However, chunking and CArrays don't work like this. If a chunk contains no data, it is not written at all! Also, all zipping takes place on the chunk level. Thus for very small chunks you can actually increase the file size and access time by using compression. For sparse matrices and CArrays, you need to play around with the chunkshape argument to create_carray() and compression. Performance is going to be affected how dense the matrix is and how grouped it is. For example, for a very dense and randomly distributed matrix, chunkshape=1 and no compression is best. For block diagonal matrices, the chunkshape should be the nominal block shape. Compression is only useful here if the blocks all have similar values or the block shape is large. For example 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 is well suited to a chunkshape=(2, 2) For more information on the HDF model please see my talk slides and video :) [1,2] I hope this helps. Be Well Anthony PS. Glad to see you using the new API ;) 1. https://github.com/scopatz/hdf5-is-for-lovers 2. http://www.youtube.com/watch?v=Nzx0HAd3FiI On Sat, Jun 22, 2013 at 6:34 PM, Giovanni Luca Ciampaglia < glciamp...@gmail.com> wrote: > Hi all, > > I have a sparse 3.4M x 3.4M adjacency matrix with nnz = 23M and wanted > to see if CArray was an appropriate solution for storing it. Right now I > am using the NumPy binary format for storing the data in coordinate > format and loading the matrix with Scipy's sparse coo_matrix class. As > far as I understand, with CArray the matrix would be written in full > (zeros included) but a) since it's chunked accessing it does not take > memory and b) with compression enabled it would possible to keep the > size of the file reasonable. > > If my assumptions are correct, then here is my problem: I am running > into problems when writing the CArray to disk. I adapted the example > from the documentation [1] and when I run the code on a 6000x6000 matrix > with nnz = 17K I achieve a decent speed of roughly 4100 elements/s. > However, when I try it on the full matrix the writing speed drops to 4 > elements/s. Am I doing something wrong? Any feedback would be greatly > appreciated! > > Code: https://gist.github.com/junkieDolphin/5843064 > > Cheers, > > Giovanni > > [1] > > http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-carray-class > > -- > Giovanni Luca Ciampaglia > > ☞ http://www.inf.usi.ch/phd/ciampaglia/ > ✆ (812) 287-3471 > ✉ glciamp...@gmail.com > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users >

