Re: [Pytables-users] Problem writing strings to a CArray. Could this be a bug?

Adriano Vilela Barbosa Thu, 24 Mar 2011 16:05:53 -0700


----- Original Message ----
> From: FrancescAlted <fal...@pytables.org>
> To: Discussion list for PyTables <pytables-users@lists.sourceforge.net>
> Sent: Thu, March 24, 2011 1:24:55 AM
> Subject: Re: [Pytables-users] Problem writing strings to a CArray. Could this 
>be a bug?
> 
> A Wednesday 23 March 2011 19:53:43 Adriano Vilela Barbosa escrigué:
> >  >From: FrancescAlted <fal...@pytables.org>
> > >To:  Discussion list for PyTables
> > ><pytables-users@lists.sourceforge.net>  Sent: Wed, March 23, 2011
> > >10:57:06 AM
> > >Subject: Re:  [Pytables-users] Problem writing strings to a CArray.
> > >Could this  be
> > >
> > >a bug?
> > >
> > >
> >  >2011/3/23 Adriano Vilela Barbosa <adriano.vil...@yahoo.com>
> >  >
> > >This is not a bug, but rather a feature of NumPy.  Look at  this:
> > >>>>> import numpy as np
> >  >>>>> a = np.array(['aa\x00\x00'])
> > >>>>>  a[0]
> > >>
> > >>'aa'          #  hey! were have my trailing 0's gone?
> > >>
> >  >>>>> a.data[:]
> > >>
> >  >>'aa\x00\x00'  # yeah, they still are in the data area of the  array
> > >>
> > >>I'd recommend you using the byte ('i1')  type for achieving what you 
> want:
> > >>>>>  a.view('i1')
> > >>
> > >>array([97, 97,  0,  0], dtype=int8)
> > >>
> > >>Thank you very much for your  explanation, but I still don't get it.
> > >>
> > >>Let's  forget numpy for a moment and just say I want to store the
> >  >>string 'aa\x00\x00' in a CArray. Each element of the CArray is a  4
> > >>element string.
> > >>
> > >>First, I  create the CArray:
> > >>>>> import tables
> >  >>>>> fid = tables.openFile("carray_test.hdf","w")
> >  >>>>> 
> > >>>>> fid.createGroup("/", 'table',  'Binary table')
> > >>>>> array_atom =  tables.StringAtom(itemsize=4)
> > >>>>> array_shape =  (1,)
> > >>>>> 
> > >>>>>  fid.createCArray(fid.root.table,'bin_table',array_atom,array_sh
> >  >>>>> ape)
> > >>
> > >>Now, I store the  string 'aa\x00\x00' in the first row (which is the
> > >>only  row
> > 
> > in
> > 
> > >>this example) of the CArray:
> > >>>>> fid.root.table.bin_table[0] =  'aa\x00\x00'
> > >>
> > >>Now, I do
> > >>
> >  >>>>> fid.root.table.bin_table[0].data[:]
> >  >>
> > >>'aa'
> > >>
> > >>So, it looks to  me that the trailing \x00 elements of the string
> > >>are not being  stored in the CArray. From my side, there's no numpy
> > >>involved;  I'm just trying to store a string. What am I missing?
> > 
> > You  cannot avoid NumPy because PyTables uses NumPy behind the scenes
> > as an  intermediate buffer area.  What you are seeing is probably a
> >  secondary effect caused by the 'feature' I mentioned before.  Any
> >  reason why you don't want to use a byte type instead of a string?
> > 
> > >FrancescAlted
> > 
> > Hi again,
> > 
> > I'm happy  to use bytes instead of strings. The reason I was using
> > strings is that,  as someone new to Python and numpy, I thought
> > strings were the only way  of dealing with individual bytes. Also,
> > because of this problem I'm  having with strings, I tried storing the
> > numpy arrays directly into the HDF file, but the performance was
> > quite poorer and the file size quite  bigger.
> > 
> > So, going back to my previous example, I guess the only  things I need
> > to change is the Atom object used to construct the CArray  and also
> > to use the method view() instead of tostring() of the numpy  array.
> > 
> > >>> import numpy
> > >>> import  tables
> > >>> fid = tables.openFile("carray_test.hdf","w")
> >  >>> fid.createGroup("/", 'table', 'Binary table')
> > >>>  array_atom = tables.Atom.from_dtype(numpy.dtype((numpy.byte,
> >  >>> (4,)))) array_shape = (1,)
> > >>>  fid.createCArray(fid.root.table,'bin_table',array_atom,array_shap
> >  >>> e) a = numpy.array(['aa\x00\x00'])
> > >>>  fid.root.table.bin_table[0] = a.view('b')
> > >>>  fid.root.table.bin_table[0].data[:]
> > 
> >  'aa\x00\x00'
> 
> Ah!  I see where the problem was now (the  assignation).  Thanks for 
> showing the point.

Yes, the problem is when assigning a string (any string, not only one obtained 
from a numpy array) to the CArray. The trailing '\x00' items are simply lost. 
In 
the examples you gave before with numpy arrays, you could still see the 
trailing 
'\x00' elements were there by doing a.data[:]; however, after assigning the 
string to the CArray, even if I do

fid.root.table.bin_table[0].data[:]

I can't see anything. Is this really the way this is supposed to work?

> 
> > Is this  right, or there's a more efficient way of doing it?
> 
> Well, I don't fully  understand why you are converting to strings prior 
> to save the info into the CArray because it should support compression 
> and be pretty fast too.   Are you really getting a significant speed-up 
> by converting to  strings?
> 
> In case you want to continue the conversion path, I'd also try a VLArray
> where the elements have been previously compressed using the blosc
> package (https://github.com/FrancescAlted/python-blosc).

In my previous, long (sorry about that) email I told you the reason I'm using 
strings: because of OpenCV. However, I converted my OpenCV images (actually, 
optical flow frames) to numpy arrays and I'm trying to store them in a CArray. 
The data can be seen as a (n_rows, n_cols, n_frames) array, where n_rows and 
n_cols are the number of rows and columns in each frame, respectively, and 
n_frames is the number of frames. The optical flow values are represented as 
int16. Initially, I did

array_shape = (n_rows,n_cols,n_frames)
array_atom = tables.Int16Atom()

and that works fine, although this is much slower and results in quite bigger 
files (compared to the string approach). Next, I did

array_shape = (n_frames,)
array_atom = tables.Int16Atom((n_rows,n_cols))

in the hope that this would be faster and more compression efficient. However, 
when creating the second CArray (I need two of them, for the horizontal and 
vertical pixel displacements) I get the following error:

Traceback (innermost last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/dist-packages/tables/file.py", line 781, in 
createCArray
    chunkshape=chunkshape, byteorder=byteorder)
  File "/usr/lib/python2.6/dist-packages/tables/carray.py", line 220, in 
__init__
    byteorder, _log)
  File "/usr/lib/python2.6/dist-packages/tables/leaf.py", line 290, in __init__
    super(Leaf, self).__init__(parentNode, name, _log)
  File "/usr/lib/python2.6/dist-packages/tables/node.py", line 296, in __init__
    self._v_objectID = self._g_create()
  File "/usr/lib/python2.6/dist-packages/tables/carray.py", line 230, in 
_g_create
    return self._g_create_common(self.nrows)
  File "/usr/lib/python2.6/dist-packages/tables/carray.py", line 253, in 
_g_create_common
    self._v_objectID = self._createCArray(self._v_new_title)
  File "hdf5Extension.pyx", line 877, in 
tables.hdf5Extension.Array._createCArray
MemoryError


This is the memory error I mentioned before. Any ideas why this happens?

> 
> Good  luck!
> 
> -- 
> FrancescAlted


Thank you very much,

Adriano



------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Problem writing strings to a CArray. Could this be a bug?

Reply via email to