----- Original Message ---- > From: FrancescAlted <fal...@pytables.org> > To: Discussion list for PyTables <pytables-users@lists.sourceforge.net> > Sent: Sat, March 26, 2011 7:00:44 AM > Subject: Re: [Pytables-users] Problem writing strings to a CArray. Could this >be a bug? > > A Friday 25 March 2011 21:12:50 Adriano Vilela Barbosa escrigué: > > > Probably not, but as I said before, trying to pack binary data as > > > strings is asking for problems. Please use a bytes array instead. > > > If what you are after is performance, then I'd say that > > > Blosc/VLArray is the way to go. > > > > I understand. As I said before, I was using strings because that's > > what the OpenCV Python bindings use to represent image data (though > > they've been moving towards numpy in their latest releases). > > Actually, representing byte streams as strings seems to be the > > standard in Python 2.x, which was kind of surprising to me when I > > first started programming in Python. > > Exactly, and this is why the Python crew has introduced the bytearray > object in Python 2.6. See more info on this in: > > http://docs.python.org/whatsnew/2.6.html#pep-3112-byte-literals
Yes. I had read a little bit about bytearrays in Python 2.6. Thanks for the link anyway. > > > > Could you send a self-contained example reproducing your problem? > > > > Please, see the code below. > > Okay. The problem was two-folded. First of all, a bug in the way > PyTables deals with the defaults, made the MemoryError (this has been > fixed in trunk). Secondly, and due to HDF5 limitations, you cannot use > atoms that are larger than 64 KB. The canonical way to handle this is > to add more dimensions to the datasets in HDF5 and then use the slice > selection capabilities to retrieve the images. Look at this: Actually, what you did below was the first thing I tried when moving away from strings. However, it resulted in my code running dozens of times slower and my HDF files being quite bigger. That's why I tried using bigger atoms (one atom per optical flow frame), to see if this would run faster and/or produce smaller files, and then I ran into the error I reported. However, I later noticed that the shape of your array is array_shape = (n_frames, n_rows, n_cols) whereas I had tried array_shape = (n_rows, n_cols, n_frames) This makes a huge difference. Using a shape (n_frames, n_rows, n_cols) for the CArray results in the code running only about 15% slower and producing a file only about 10% bigger when compared to using strings. This is much better than the results I was getting when using a shape (n_rows, n_cols, n_frames). I guess this has to do with the way the data is laid out on disk? As for the atom size limit (64 kB), I guess that doesn't apply to string atoms? When using strings, I construct the atom in the following way array_atom = tables.StringAtom(len(matrix.tostring())) where len(matrix.tostring()) = 691200 bytes = 675 kB. I mean, the size of the string atom is quite above the 64 kB limit and yet it doesn't produce any erros. Thanks a lot for your help. Adriano > > import tables > import numpy > from time import time > > # ----- Writing data to file ----- # > > # Open the output file for writing > fid = tables.openFile("carray_error.hdf","w") > > # Create a table group > fid.createGroup("/", 'table', 'Flow table') > > # The number of rows and columns in a frame, and the number of frames > n_rows = 480 > n_cols = 720 > n_frames = 2 > > # Create a numpy vector to be stored in the Carray > matrix = numpy.random.randn(n_rows,n_cols) > > # The CArray shape > array_shape = (n_frames, n_rows, n_cols) > > # The CArray atom > array_atom = tables.Int16Atom() > > # Create a Carray for holding horizontal flow values > fid.createCArray(fid.root.table,'flow_x',array_atom,array_shape) > > # Create a Carray for holding vertical flow values. This is where we > # get an error; working with smaller values of n_rows and n_cols works > # fine though. > fid.createCArray(fid.root.table,'flow_y',array_atom,array_shape) > > t0 = time() > for m in range(n_frames): > fid.root.table.flow_x[0] = matrix > fid.root.table.flow_y[0] = matrix > print "time to save a couple of matrices:", round(time()-t0, 3) > > # ----- Reading data from file ----- # > > print "flow_x:", fid.root.table.flow_x[0] > print "flow_y:", fid.root.table.flow_y[0] > > # Close the output file > fid.close() > > And the output: > > time to save a couple of matrices: 0.004 > flow_x: [[ 0 0 0 ..., 0 1 0] > [ 1 0 0 ..., 0 0 0] > [ 1 0 0 ..., 0 0 0] > ..., > [ 1 2 -1 ..., -1 0 1] > [ 2 0 -1 ..., 0 0 -1] > [-1 1 0 ..., -1 0 0]] > flow_y: [[ 0 0 0 ..., 0 1 0] > [ 1 0 0 ..., 0 0 0] > [ 1 0 0 ..., 0 0 0] > ..., > [ 1 2 -1 ..., -1 0 1] > [ 2 0 -1 ..., 0 0 -1] > [-1 1 0 ..., -1 0 0]] > > Hope this helps, > > -- > Francesc Alted > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ Create and publish websites with WebMatrix Use the most popular FREE web apps or write code yourself; WebMatrix provides all the features you need to develop and publish your website. http://p.sf.net/sfu/ms-webmatrix-sf _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users