A Monday 13 December 2010 15:08:03 Francesc Alted escrigué: > A Monday 13 December 2010 14:56:26 Dominik Szczerba escrigué: > > > But, for knowing if accessing columns this is efficient for your > > > case, I'd need more info on your datasets. Are they contiguous > > > or chunked? If chunked, which is the chunkshape you have chosen? > > > > Both. Files saved from matlab are uncompressed/contiguous, the ones > > saved from my program are usually compressed/chunked and the size > > is around 1024^2/sizeof(type). > > Well, for PyTables (or any C application) and contiguous datasets, > accessing data by columns is inefficient: the privileged direction > for performance are rows.
I was curious to see the difference in performance. Here are some timings: >>> nptetra = np.empty((4, 4622544)) >>> f = tb.openFile("/tmp/t.h5", "w") >>> tetra = f.createArray(f.root, "tetra", nptetra) >>> %time [ tetra[:,i] for i in range(4622544) ] CPU times: user 201.61 s, sys: 162.59 s, total: 364.20 s Wall time: 367.91 s Using the transposed version (i.e. accessing by rows): >>> tetra2 = f.createArray(f.root, "tetra2", nptetra.transpose()) >>> %time [ tetra2[i] for i in range(4622544) ] CPU times: user 163.78 s, sys: 0.48 s, total: 164.25 s Wall time: 165.44 s # the time is more than 2x faster But using the iterator is the fastest mode (the I/O is buffered): >>> %time [ row for row in tetra2 ] CPU times: user 26.21 s, sys: 0.38 s, total: 26.59 s Wall time: 26.81 s I'd say that for chunked datasets you can expect something similar. -- Francesc Alted ------------------------------------------------------------------------------ Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users