----- Original Message ----
> From: FrancescAlted <fal...@pytables.org>
> To: Discussion list for PyTables <pytables-users@lists.sourceforge.net>
> Sent: Thu, March 24, 2011 1:24:55 AM
> Subject: Re: [Pytables-users] Problem writing strings to a CArray. Could this
>be a bug?
>
> A Wednesday 23 March 2011 19:53:43 Adriano Vilela Barbosa escrigué:
> > >From: FrancescAlted <fal...@pytables.org>
> > >To: Discussion list for PyTables
> > ><pytables-users@lists.sourceforge.net> Sent: Wed, March 23, 2011
> > >10:57:06 AM
> > >Subject: Re: [Pytables-users] Problem writing strings to a CArray.
> > >Could this be
> > >
> > >a bug?
> > >
> > >
> > >2011/3/23 Adriano Vilela Barbosa <adriano.vil...@yahoo.com>
> > >
> > >This is not a bug, but rather a feature of NumPy. Look at this:
> > >>>>> import numpy as np
> > >>>>> a = np.array(['aa\x00\x00'])
> > >>>>> a[0]
> > >>
> > >>'aa' # hey! were have my trailing 0's gone?
> > >>
> > >>>>> a.data[:]
> > >>
> > >>'aa\x00\x00' # yeah, they still are in the data area of the array
> > >>
> > >>I'd recommend you using the byte ('i1') type for achieving what you
> want:
> > >>>>> a.view('i1')
> > >>
> > >>array([97, 97, 0, 0], dtype=int8)
> > >>
> > >>Thank you very much for your explanation, but I still don't get it.
> > >>
> > >>Let's forget numpy for a moment and just say I want to store the
> > >>string 'aa\x00\x00' in a CArray. Each element of the CArray is a 4
> > >>element string.
> > >>
> > >>First, I create the CArray:
> > >>>>> import tables
> > >>>>> fid = tables.openFile("carray_test.hdf","w")
> > >>>>>
> > >>>>> fid.createGroup("/", 'table', 'Binary table')
> > >>>>> array_atom = tables.StringAtom(itemsize=4)
> > >>>>> array_shape = (1,)
> > >>>>>
> > >>>>> fid.createCArray(fid.root.table,'bin_table',array_atom,array_sh
> > >>>>> ape)
> > >>
> > >>Now, I store the string 'aa\x00\x00' in the first row (which is the
> > >>only row
> >
> > in
> >
> > >>this example) of the CArray:
> > >>>>> fid.root.table.bin_table[0] = 'aa\x00\x00'
> > >>
> > >>Now, I do
> > >>
> > >>>>> fid.root.table.bin_table[0].data[:]
> > >>
> > >>'aa'
> > >>
> > >>So, it looks to me that the trailing \x00 elements of the string
> > >>are not being stored in the CArray. From my side, there's no numpy
> > >>involved; I'm just trying to store a string. What am I missing?
> >
> > You cannot avoid NumPy because PyTables uses NumPy behind the scenes
> > as an intermediate buffer area. What you are seeing is probably a
> > secondary effect caused by the 'feature' I mentioned before. Any
> > reason why you don't want to use a byte type instead of a string?
> >
> > >FrancescAlted
> >
> > Hi again,
> >
> > I'm happy to use bytes instead of strings. The reason I was using
> > strings is that, as someone new to Python and numpy, I thought
> > strings were the only way of dealing with individual bytes. Also,
> > because of this problem I'm having with strings, I tried storing the
> > numpy arrays directly into the HDF file, but the performance was
> > quite poorer and the file size quite bigger.
> >
> > So, going back to my previous example, I guess the only things I need
> > to change is the Atom object used to construct the CArray and also
> > to use the method view() instead of tostring() of the numpy array.
> >
> > >>> import numpy
> > >>> import tables
> > >>> fid = tables.openFile("carray_test.hdf","w")
> > >>> fid.createGroup("/", 'table', 'Binary table')
> > >>> array_atom = tables.Atom.from_dtype(numpy.dtype((numpy.byte,
> > >>> (4,)))) array_shape = (1,)
> > >>> fid.createCArray(fid.root.table,'bin_table',array_atom,array_shap
> > >>> e) a = numpy.array(['aa\x00\x00'])
> > >>> fid.root.table.bin_table[0] = a.view('b')
> > >>> fid.root.table.bin_table[0].data[:]
> >
> > 'aa\x00\x00'
>
> Ah! I see where the problem was now (the assignation). Thanks for
> showing the point.
>
> > Is this right, or there's a more efficient way of doing it?
>
> Well, I don't fully understand why you are converting to strings prior
> to save the info into the CArray because it should support compression
> and be pretty fast too. Are you really getting a significant speed-up
> by converting to strings?
>
> In case you want to continue the conversion path, I'd also try a VLArray
> where the elements have been previously compressed using the blosc
> package (https://github.com/FrancescAlted/python-blosc).
>
> Good luck!
>
> --
> FrancescAlted
>
So, here's the long story. I have been using OpenCV
(http://opencv.willowgarage.com/wiki/) to compute optical flow from video
sequences. The videos I work with are typically 720 x 480 pixels @ 30 fps, but
this can be as high as 1280 x 720 pixels @ 60 fps. For each pair of consecutive
frames, optical flow creates two "matrices" the same size as the video frame,
with the horizontal and vertical pixel displacements. I said "matrices" because
these results are actually stored as IplImages
(http://opencv.willowgarage.com/documentation/python/core_basic_structures.html#index-5356).
They have a method tostring() which gives you the "image" contents as a byte
stream. My earliest pieces of code simply used this to store the optical flow
results to a binary file, and this works just fine. However, the resulting
files
are huge (hundreds of gigabytes, depending on the video duration). So, being
able to use compression was one of my initial motivation for using PyTables. On
the other hand, I have a piece of Matlab code that reads the optical flow
binary
files and does lots of processing on the results.So, my reason for using
strings
is that that's what OpenCV gives you. However, they've been moving towards
Numpy
in their later releases, and I'll have a look into that.
Because of the problems I'm having with strings, I did try to store the optical
flow results as numpy arrays. That involves converting the IplImages into numpy
arrays prior to storing them in the HDF file. I did this passing the output of
IplImage.tostring() to numpy.fromstring() but I believe more recent versions of
OpenCV have numpy support built in. Once I converted the IplImages to numpy
arrays, I tried to store them in a CArray with as many elements as there are
flow matrices, and with an atom with the same dimension as the optical flow
frame (e.g., 720 x 480). When I try to create the second CArray (I need two;
one
for the horizontal and the other for the vertical component of the optical flow
vectors) PyTables throws a memory error. Because of this error, I tried
creating
Carrays whose dimension is [n_rows n_cols n_frames] with an atom of a single 32
float (optical flow values are represented by 32 bit floats). This works, but
it's dozens of times slower than storing strings directly, and the output file
size is usually 50% larger (which makes a big difference when we're dealing
with
files with hundred of gigabytes in size).
I was happily using
fid.root.table.bin_table[frame_number] = flow_frame.tostring()
to store the flow frames in the Carray until I ran into an optical flow frame
whose string representation ended with '\x00'. In the end, I may simply decide
to keep using the strings and append the missing '\x00' values to them when
reading the files. As I know how long the strings should be, I can tell how
many
'\x00' values are missing, if any. Not ideal, but it may be the easiest thing
to
do.
I'll also have a look at the blosc package. Thank you for showing me that.
Thanks a lot for your help. I really appreciate it.
Adriano
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users