Hi Giovanni!
I think that you may have some misunderstanding about how chucking works,
which is leading you to get terrible performance. In fact what you
describe is a great strategy (right all and zip) for using normal Arrays.
However, chunking and CArrays don't work like this. If a chunk contains no
data, it is not written at all! Also, all zipping takes place on the chunk
level. Thus for very small chunks you can actually increase the file size
and access time by using compression.
For sparse matrices and CArrays, you need to play around with the
chunkshape argument to create_carray() and compression. Performance is
going to be affected how dense the matrix is and how grouped it is. For
example, for a very dense and randomly distributed matrix, chunkshape=1 and
no compression is best. For block diagonal matrices, the chunkshape should
be the nominal block shape. Compression is only useful here if the blocks
all have similar values or the block shape is large. For example
1 1 0 0 0 0
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 0 0
0 0 0 0 1 1
0 0 0 0 1 1
is well suited to a chunkshape=(2, 2)
For more information on the HDF model please see my talk slides and video
:) [1,2] I hope this helps.
Be Well
Anthony
PS. Glad to see you using the new API ;)
1. https://github.com/scopatz/hdf5-is-for-lovers
2. http://www.youtube.com/watch?v=Nzx0HAd3FiI
On Sat, Jun 22, 2013 at 6:34 PM, Giovanni Luca Ciampaglia <
glciamp...@gmail.com> wrote:
> Hi all,
>
> I have a sparse 3.4M x 3.4M adjacency matrix with nnz = 23M and wanted
> to see if CArray was an appropriate solution for storing it. Right now I
> am using the NumPy binary format for storing the data in coordinate
> format and loading the matrix with Scipy's sparse coo_matrix class. As
> far as I understand, with CArray the matrix would be written in full
> (zeros included) but a) since it's chunked accessing it does not take
> memory and b) with compression enabled it would possible to keep the
> size of the file reasonable.
>
> If my assumptions are correct, then here is my problem: I am running
> into problems when writing the CArray to disk. I adapted the example
> from the documentation [1] and when I run the code on a 6000x6000 matrix
> with nnz = 17K I achieve a decent speed of roughly 4100 elements/s.
> However, when I try it on the full matrix the writing speed drops to 4
> elements/s. Am I doing something wrong? Any feedback would be greatly
> appreciated!
>
> Code: https://gist.github.com/junkieDolphin/5843064
>
> Cheers,
>
> Giovanni
>
> [1]
>
> http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-carray-class
>
> --
> Giovanni Luca Ciampaglia
>
> ☞ http://www.inf.usi.ch/phd/ciampaglia/
> ✆ (812) 287-3471
> ✉ glciamp...@gmail.com
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users