Hi Francesc,

Thank you for your reply. I've done some testing of building a pytables file 
with a single table as set up as follows:
class Seg(tables.IsDescription):
    n_iter = tables.UInt32Col(pos=0)
    seg_id = tables.UInt32Col(pos=1)
    x = tables.Float32Col(shape=(405,3),pos=2)
        
hf5 = tables.openFile('testfile.h5','w',title='test file')
table = hf5.createTable(hf5.root,'data',Seg,'data')
seg = table.row
x = numpy.random.normal(size=(405,3))
itn = 0
for k in xrange(int(1.3e6)):
    if (k % 5000 == 0):
        itn += 1
        print k
    
    seg['n_iter'] = itn
    seg['seg_id'] = k
    seg['x'] = x
    seg.append()
    
hf5.close()

This gives a single table with 1.3 million rows. The rowsize is about 5 kb. I'm 
concerned though that if I want to find all rows (in particular the seg_id) 
corresponding to a particular n_iter value, the following seems to be very slow 
(takes a couple of minutes on my 2.8 GHz Intel Core Duo Macbook Pro:
s = [y['seg_id'] for y in h.root.data.where('n_iter == 200')]

Do you have any suggestions on re-organizing things to get better performance? 
Is the above code the correct interpretation of what you suggested in your 
original reply? I've only just started using pytables, so perhaps I'm missing 
something important, since what I do above seems similar to the in-kernel 
example in the manual, yet I'm getting much poorer performance.

Thanks,
Josh



On Jun 2, 2010, at 6:17 AM, Francesc Alted wrote:

> A Wednesday 02 June 2010 05:22:15 Joshua Adelman escrigué:
>> Looking through the archives there seems to be a number of suggestions on
>> organizing data, but all of the ones I could find are for data sets that
>> are more complicated than I need.
>> 
>> I just need to store the 3D coordinates (~ 500x3 numpy array) of a number
>> of instances of a system for each iteration. Basically I have a variable
>> number of 'molecules' per iteration, made up of about 500 particles (this
>> shape is fixed). Each iteration has 20-5000 such molecules. In the end I
>> imagine that I will need to store on the order of 10-200 million such
>> 500x3 arrays, grouped by iteration and labeled by an integer
>> identification number. Since I'm planning on using the pytables in tandem
>> with a sqlalchemy based sqlite database, I don't need to store any other
>> information in the pytable, besides the coordinates and the iteration and
>> id labels.
>> 
>> After writing the data I would need easy access to the coordinate array
>> using the iteration and molecule index id, and the ability to say I want
>> all of the coordinates stored for iteration N. Would it make sense to make
>> a table with columns for iterations, id and coordinates, or would it be
>> better to use an EArray or Vlarray (although the data isn't jagged so the
>> later may not be a good choice)? My understanding from reading posts on
>> the list is that it would be highly inefficient to save each 500x3 numpy
>> array as its own node
>> 
>> Any suggestions on a more optimal way of organizing this sort of data set
>> would be most appreciated.
> 
> Well, I'd say that 500x3 would be around 12 KB per molecule (assuming that 
> the 
> 3D coordinates are 8-byte elements).  This is not too much, and pytables will 
> choose a chunksize that is able to contain a handful of rows (molecules) per 
> chunk, which is fine.  Then, you can use the selection capabilities of tables 
> to easily select the interesting iterations.  So I'd go this venue.
> 
> For the record, the problem is when you have row sizes that exceeds by far 
> the 
> chunksize of a dataset.  Typically, pytables chunksizes are between 32 KB up 
> to 256 KB.  When you have row sizes that exceeds 10x these sizes, then you 
> may 
> have performance problems.  But if not, you can safely use fixed-size row 
> objects (like Table, EArray or CArray) in pytables.
> 
> Hope this helps,
> 
> -- 
> Francesc Alted
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users



------------------------------------------------------------------------------

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to