Re: [Pytables-users] Organization of Simple Coordinate Data

Francesc Alted Wed, 02 Jun 2010 10:48:28 -0700

Ah, you are ending with a table that is more than 6 GB large, and that's 
probably more than the memory in your laptop, so the query has to traverse all 
the table *on-disk*, and this is why the query is slow.  I suppose that I was 
thinking that you was going to use indexing, but forgot that this is only 
available in the Pro version.


I'd suggest using the VLArray approach then.  I'd define an atom that contain 
a molecule and fill each row of the VLArray with an iteration.  Then, 
retrieving an iteration is as easy as an indexing operation 
(your_vlarray[n_inter]).  Don't know how well HDF5 will handle so large rows, 
but it is worth a try.

Hope that helps,


A Wednesday 02 June 2010 17:41:34 Joshua Adelman escrigué:
> Hi Francesc,
> 
> Thank you for your reply. I've done some testing of building a pytables
>  file with a single table as set up as follows: class
>  Seg(tables.IsDescription):
>     n_iter = tables.UInt32Col(pos=0)
>     seg_id = tables.UInt32Col(pos=1)
>     x = tables.Float32Col(shape=(405,3),pos=2)
> 
> hf5 = tables.openFile('testfile.h5','w',title='test file')
> table = hf5.createTable(hf5.root,'data',Seg,'data')
> seg = table.row
> x = numpy.random.normal(size=(405,3))
> itn = 0
> for k in xrange(int(1.3e6)):
>     if (k % 5000 == 0):
>         itn += 1
>         print k
> 
>     seg['n_iter'] = itn
>     seg['seg_id'] = k
>     seg['x'] = x
>     seg.append()
> 
> hf5.close()
> 
> This gives a single table with 1.3 million rows. The rowsize is about 5 kb.
>  I'm concerned though that if I want to find all rows (in particular the
>  seg_id) corresponding to a particular n_iter value, the following seems to
>  be very slow (takes a couple of minutes on my 2.8 GHz Intel Core Duo
>  Macbook Pro: s = [y['seg_id'] for y in h.root.data.where('n_iter == 200')]
> 
> Do you have any suggestions on re-organizing things to get better
>  performance? Is the above code the correct interpretation of what you
>  suggested in your original reply? I've only just started using pytables,
>  so perhaps I'm missing something important, since what I do above seems
>  similar to the in-kernel example in the manual, yet I'm getting much
>  poorer performance.
> 
> Thanks,
> Josh
> 
> On Jun 2, 2010, at 6:17 AM, Francesc Alted wrote:
> > A Wednesday 02 June 2010 05:22:15 Joshua Adelman escrigué:
> >> Looking through the archives there seems to be a number of suggestions
> >> on organizing data, but all of the ones I could find are for data sets
> >> that are more complicated than I need.
> >>
> >> I just need to store the 3D coordinates (~ 500x3 numpy array) of a
> >> number of instances of a system for each iteration. Basically I have a
> >> variable number of 'molecules' per iteration, made up of about 500
> >> particles (this shape is fixed). Each iteration has 20-5000 such
> >> molecules. In the end I imagine that I will need to store on the order
> >> of 10-200 million such 500x3 arrays, grouped by iteration and labeled by
> >> an integer
> >> identification number. Since I'm planning on using the pytables in
> >> tandem with a sqlalchemy based sqlite database, I don't need to store
> >> any other information in the pytable, besides the coordinates and the
> >> iteration and id labels.
> >>
> >> After writing the data I would need easy access to the coordinate array
> >> using the iteration and molecule index id, and the ability to say I want
> >> all of the coordinates stored for iteration N. Would it make sense to
> >> make a table with columns for iterations, id and coordinates, or would
> >> it be better to use an EArray or Vlarray (although the data isn't jagged
> >> so the later may not be a good choice)? My understanding from reading
> >> posts on the list is that it would be highly inefficient to save each
> >> 500x3 numpy array as its own node
> >>
> >> Any suggestions on a more optimal way of organizing this sort of data
> >> set would be most appreciated.
> >
> > Well, I'd say that 500x3 would be around 12 KB per molecule (assuming
> > that the 3D coordinates are 8-byte elements).  This is not too much, and
> > pytables will choose a chunksize that is able to contain a handful of
> > rows (molecules) per chunk, which is fine.  Then, you can use the
> > selection capabilities of tables to easily select the interesting
> > iterations.  So I'd go this venue.
> >
> > For the record, the problem is when you have row sizes that exceeds by
> > far the chunksize of a dataset.  Typically, pytables chunksizes are
> > between 32 KB up to 256 KB.  When you have row sizes that exceeds 10x
> > these sizes, then you may have performance problems.  But if not, you can
> > safely use fixed-size row objects (like Table, EArray or CArray) in
> > pytables.
> >
> > Hope this helps,
> 

-- 
Francesc Alted

------------------------------------------------------------------------------

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Organization of Simple Coordinate Data

Reply via email to