----- Original Message ----
> From: FrancescAlted <fal...@pytables.org>
> To: Discussion list for PyTables <pytables-users@lists.sourceforge.net>
> Sent: Thu, March 24, 2011 1:24:55 AM
> Subject: Re: [Pytables-users] Problem writing strings to a CArray. Could this 
>be a bug?
> 
> A Wednesday 23 March 2011 19:53:43 Adriano Vilela Barbosa escrigué:
> >  >From: FrancescAlted <fal...@pytables.org>
> > >To:  Discussion list for PyTables
> > ><pytables-users@lists.sourceforge.net>  Sent: Wed, March 23, 2011
> > >10:57:06 AM
> > >Subject: Re:  [Pytables-users] Problem writing strings to a CArray.
> > >Could this  be
> > >
> > >a bug?
> > >
> > >
> >  >2011/3/23 Adriano Vilela Barbosa <adriano.vil...@yahoo.com>
> >  >
> > >This is not a bug, but rather a feature of NumPy.  Look at  this:
> > >>>>> import numpy as np
> >  >>>>> a = np.array(['aa\x00\x00'])
> > >>>>>  a[0]
> > >>
> > >>'aa'          #  hey! were have my trailing 0's gone?
> > >>
> >  >>>>> a.data[:]
> > >>
> >  >>'aa\x00\x00'  # yeah, they still are in the data area of the  array
> > >>
> > >>I'd recommend you using the byte ('i1')  type for achieving what you 
> want:
> > >>>>>  a.view('i1')
> > >>
> > >>array([97, 97,  0,  0], dtype=int8)
> > >>
> > >>Thank you very much for your  explanation, but I still don't get it.
> > >>
> > >>Let's  forget numpy for a moment and just say I want to store the
> >  >>string 'aa\x00\x00' in a CArray. Each element of the CArray is a  4
> > >>element string.
> > >>
> > >>First, I  create the CArray:
> > >>>>> import tables
> >  >>>>> fid = tables.openFile("carray_test.hdf","w")
> >  >>>>> 
> > >>>>> fid.createGroup("/", 'table',  'Binary table')
> > >>>>> array_atom =  tables.StringAtom(itemsize=4)
> > >>>>> array_shape =  (1,)
> > >>>>> 
> > >>>>>  fid.createCArray(fid.root.table,'bin_table',array_atom,array_sh
> >  >>>>> ape)
> > >>
> > >>Now, I store the  string 'aa\x00\x00' in the first row (which is the
> > >>only  row
> > 
> > in
> > 
> > >>this example) of the CArray:
> > >>>>> fid.root.table.bin_table[0] =  'aa\x00\x00'
> > >>
> > >>Now, I do
> > >>
> >  >>>>> fid.root.table.bin_table[0].data[:]
> >  >>
> > >>'aa'
> > >>
> > >>So, it looks to  me that the trailing \x00 elements of the string
> > >>are not being  stored in the CArray. From my side, there's no numpy
> > >>involved;  I'm just trying to store a string. What am I missing?
> > 
> > You  cannot avoid NumPy because PyTables uses NumPy behind the scenes
> > as an  intermediate buffer area.  What you are seeing is probably a
> >  secondary effect caused by the 'feature' I mentioned before.  Any
> >  reason why you don't want to use a byte type instead of a string?
> > 
> > >FrancescAlted
> > 
> > Hi again,
> > 
> > I'm happy  to use bytes instead of strings. The reason I was using
> > strings is that,  as someone new to Python and numpy, I thought
> > strings were the only way  of dealing with individual bytes. Also,
> > because of this problem I'm  having with strings, I tried storing the
> > numpy arrays directly into the HDF file, but the performance was
> > quite poorer and the file size quite  bigger.
> > 
> > So, going back to my previous example, I guess the only  things I need
> > to change is the Atom object used to construct the CArray  and also
> > to use the method view() instead of tostring() of the numpy  array.
> > 
> > >>> import numpy
> > >>> import  tables
> > >>> fid = tables.openFile("carray_test.hdf","w")
> >  >>> fid.createGroup("/", 'table', 'Binary table')
> > >>>  array_atom = tables.Atom.from_dtype(numpy.dtype((numpy.byte,
> >  >>> (4,)))) array_shape = (1,)
> > >>>  fid.createCArray(fid.root.table,'bin_table',array_atom,array_shap
> >  >>> e) a = numpy.array(['aa\x00\x00'])
> > >>>  fid.root.table.bin_table[0] = a.view('b')
> > >>>  fid.root.table.bin_table[0].data[:]
> > 
> >  'aa\x00\x00'
> 
> Ah!  I see where the problem was now (the  assignation).  Thanks for 
> showing the point.
> 
> > Is this  right, or there's a more efficient way of doing it?
> 
> Well, I don't fully  understand why you are converting to strings prior 
> to save the info into the CArray because it should support compression 
> and be pretty fast too.   Are you really getting a significant speed-up 
> by converting to  strings?
> 
> In case you want to continue the conversion path, I'd also try a VLArray
> where the elements have been previously compressed using the blosc
> package (https://github.com/FrancescAlted/python-blosc).
> 
> Good  luck!
> 
> -- 
> FrancescAlted
> 

So, here's the long story. I have been using OpenCV 
(http://opencv.willowgarage.com/wiki/) to compute optical flow from video 
sequences. The videos I work with are typically 720 x 480 pixels @ 30 fps, but 
this can be as high as 1280 x 720 pixels @ 60 fps. For each pair of consecutive 
frames, optical flow creates two "matrices" the same size as the video frame, 
with the horizontal and vertical pixel displacements. I said "matrices" because 
these results are actually stored as IplImages 
(http://opencv.willowgarage.com/documentation/python/core_basic_structures.html#index-5356).
 They have a method tostring() which gives you the "image" contents as a byte 
stream. My earliest pieces of code simply used this to store the optical flow 
results to a binary file, and this works just fine. However, the resulting 
files 
are huge (hundreds of gigabytes, depending on the video duration). So, being 
able to use compression was one of my initial motivation for using PyTables. On 
the other hand, I have a piece of Matlab code that reads the optical flow 
binary 
files and does lots of processing on the results.So, my reason for using 
strings 
is that that's what OpenCV gives you. However, they've been moving towards 
Numpy 
in their later releases, and I'll have a look into that.

Because of the problems I'm having with strings, I did try to store the optical 
flow results as numpy arrays. That involves converting the IplImages into numpy 
arrays prior to storing them in the HDF file. I did this passing the output of 
IplImage.tostring() to numpy.fromstring() but I believe more recent versions of 
OpenCV have numpy support built in. Once I converted the IplImages to numpy 
arrays, I tried to store them in a CArray with as many elements as there are 
flow matrices, and with an atom with the same dimension as the optical flow 
frame (e.g., 720 x 480). When I try to create the second CArray (I need two; 
one 
for the horizontal and the other for the vertical component of the optical flow 
vectors) PyTables throws a memory error. Because of this error, I tried 
creating 
Carrays whose dimension is [n_rows n_cols n_frames] with an atom of a single 32 
float (optical flow values are represented by 32 bit floats). This works, but 
it's dozens of times slower than storing strings directly, and the output file 
size is usually 50% larger (which makes a big difference when we're dealing 
with 
files with hundred of gigabytes in size).

I was happily using

fid.root.table.bin_table[frame_number] = flow_frame.tostring()

to store the flow frames in the Carray until I ran into an optical flow frame 
whose string representation ended with '\x00'. In the end, I may simply decide 
to keep using the strings and append the missing '\x00' values to them when 
reading the files. As I know how long the strings should be, I can tell how 
many 
'\x00' values are missing, if any. Not ideal, but it may be the easiest thing 
to 
do.

I'll also have a look at the blosc package. Thank you for showing me that.

Thanks a lot for your help. I really appreciate it.

Adriano



------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to