Re: [Pytables-users] Best way to store sequences of arbitrary objects containing arrays

Ivan Vilata i Balaguer Tue, 20 May 2008 01:38:14 -0700

Anand Patil (el 2008-05-19 a les 17:48:24 +0100) va dir::

> I'd like to store a long sequence of python objects with pytables. The  
> only things I know about the objects are:
> 
> - Their memory footprint is dominated by a big numpy array, and
> - The attribute name of the big array for each object is the same;  
> it's x1.big_array, x2.big_array, etc.
> 
> I would rather not require the array to be the same shape for each  
> object.
> 
> I think I'd want to to make a group with a single ObjectAtom array and  
> a whole bunch of arrays whose atoms correspond to big_array.dtype. To  
> store an object, I'd destroy all its references to its big_array,  
> pickle it in the ObjectAtom array, and store its big_array in one of  
> the other arrays.
> 
> My questions are:
> - Is this the best way to go?
> - What kind of performance penalty am I incurring by storing each of  
> the big_array attributes in its own pytables array, rather than making  
> them cells in a table? How can I mitigate it?
> - How can I make sure that all of an object's references to its  
> big_array get destroyed, so that the latter doesn't get pickled with  
> the object?


I find your approach a quite reasonable one.  You'd have some overhead
when creating each of the data arrays (for the node metadata), but it
could be overcome by the space gains you'd get if using ``CArray`` or
``EArray`` nodes with compression.  Then, if you have more than 4096
nodes, you should be careful not to place them all in the same group to
avoid performance problems with the object tree.

Since PyTables doesn't alter the data you store, to avoid storing the
array along the object you could define your own loader and storer
functions that replaced the ``big_array`` attribute by some kind of
reference to the array node (i.e. by storing its path), something like::

  import copy

  def store_object(obj, vlarray):
      array = obj.big_array
      arrpath, arrname = compute_data_path(obj, vlarray, ...)
      st_obj = copy.copy(obj)  # shallow copy
      st_obj.big_array = (arrpath, arrname)
      vlarray.append(st_obj)
      st_obj_pos = len(vlarray) - 1
      arr = vlarray._v_file.createCArray(
          arrpath, arrname, tables.atom_from_dtype(array.dtype), array.shape )
      arr[:] = big_array
      return st_obj_pos

  def load_object(st_obj_pos, vlarray):
      obj = vlarray[st_obj_pos]
      arrpath, arrname = obj.big_array
      arr = vlarray._v_file.getNode(arrpath, arrname)
      obj.big_array = arr[:]
      return obj

Hope that helps,

::

  Ivan Vilata i Balaguer   @ Welcome to the European Banana Republic! @
  http://www.selidor.net/  @     http://www.nosoftwarepatents.com/    @

signature.asc
Description: Digital signature

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Best way to store sequences of arbitrary objects containing arrays

Reply via email to