Hi Pauli, This is the answer of Quicey Koziol, one core developer of the HDF5 library about the memory problem when updating many chunks at the same time. Can you try the latest version of HDF5 1.8 series? It seems like this problem would be much alleviated. Remember to add the "--with-default-api-version=v16" flag when compiling the HDF5 library in order to be able to link with PyTables later on.
Cheers, ---------- Missatge transmès ---------- Subject: Re: Writing to a dataset with 'wrong' chunksize Date: Tuesday 27 November 2007 From: Quincey Koziol <[EMAIL PROTECTED]> To: Francesc Altet <[EMAIL PROTECTED]> Hi Francesc, On Nov 23, 2007, at 2:06 PM, Francesc Altet wrote: > Hi, > > Some time ago, a Pytables user complained about that the next simple > operation was hogging gigantics amounts of memory: > > import tables, numpy > N = 600 > f = tables.openFile('foo.h5', 'w') > f.createCArray(f.root, 'huge_array', > tables.Float64Atom(), > shape = (2,2,N,N,50,50)) > for i in xrange(50): > for j in xrange(50): > f.root.huge_array[:,:,:,:,j,i] = \ > numpy.array([[1,0],[0,1]])[:,:,None,None] > > and I think that the problem could be in the HDF5 side. > > The point is that, for the 6-th dimensional 'huge_array' dataset, > Pytables computed an 'optimal' chunkshape of (1, 1, 1, 6, 50, 50). > Then, the user wanted to update the array starting in the trailing > dimensions (instead of using the leading ones, which is the > recommended > practice for C-ordered arrays). This results in Pytables asking HDF5 > to do the update using the traditional procedure: > > /* Create a simple memory data space */ > if ( (mem_space_id = H5Screate_simple( rank, count, NULL )) < 0 ) > return -3; > > /* Get the file data space */ > if ( (space_id = H5Dget_space( dataset_id )) < 0 ) > return -4; > > /* Define a hyperslab in the dataset */ > if ( rank != 0 && H5Sselect_hyperslab( space_id, H5S_SELECT_SET, > start, > step, count, NULL) < 0 ) > return -5; > > if ( H5Dwrite( dataset_id, type_id, mem_space_id, space_id, > H5P_DEFAULT, data ) < 0 ) > return -6; > > While I understand that this approach is suboptimal > (2*2*600*100=240000 > chunks has to 'updated' for each update operation in the loop), I > don't > understand completely the reason why the user reports that the script > is consuming so much memory (the script crashes, but perhaps it is > asking for several GB). My guess is that perhaps HDF5 is trying to > load all the affected chunks in-memory before trying to update them, > but I thought it is best to report this here just in case this is a > bug, or, if not, the huge demand of memory can be somewhat alleviated. Is this with the 1.6.x library code? If so, it would be worthwhile checking with the 1.8.0 code, which is designed to do all the I/O on each chunk at once and then proceed to the next chunk. However, it does build information about the selection on each chunk to update and if the I/O operation will update 240,000 chunks, that could be a large amount of memory... Quincey > In case you need more information, you may find it by following the > details of the discussion in the next thread: > > http://www.mail-archive.com/pytables-users@lists.sourceforge.net/ > msg00722.html > > Thanks! > > -- >> 0,0< Francesc Altet http://www.carabos.com/ > V V Cárabos Coop. V. Enjoy Data > "-" > > ---------------------------------------------------------------------- > This mailing list is for HDF software users discussion. > To subscribe to this list, send a message to hdf-forum- > [EMAIL PROTECTED] > To unsubscribe, send a message to [EMAIL PROTECTED] > > ---------------------------------------------------------------------- This mailing list is for HDF software users discussion. To subscribe to this list, send a message to [EMAIL PROTECTED] To unsubscribe, send a message to [EMAIL PROTECTED] ------------------------------------------------------- -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users