Re: [Pytables-users] Data Format Suggestions

Francesc Alted Tue, 06 May 2008 00:35:08 -0700

A Monday 05 May 2008, Glenn escrigué:
> Francesc Alted <falted <at> pytables.org> writes:
> > > > Representing a 1D column is as easy as passing a 'shape=(N,)'
> > > > argument to your 1D columns.  Look at this example:
> > > >
> > > > N = 10  # your 1D array length
> > > > class TTable(tables.IsDescription):
> > > >     col1 = tables.Int32Col(pos=0)
> > > >     col2 = tables.Float64Col(shape=(N,), pos=1)  # you 1D
> > > > column f = tables.openFile("test.h5", "w")
> > > > t = f.createTable(f.root, 'table', TTable, 'table test')
> > > > for i in xrange(10):
> > > >     t.append([[i, numpy.random.rand(N)]])
> > > > t.flush()
> > > > f.close()
> > > >
> > > > Hope that helps,
> > >
> > > Thank you for the help, I got it working with a Table now.
> > > I have a couple of new questions:
> > > My table has a column with a 1000 element 1d numpy array. I would
> > > like to do the following types of operations where I treat this
> > > column as a N x 1000 2d array, call it X:
> > > mean(X,axis=0)
> > >
> > > std(X[k].reshape((k, N/k)))
> > >
> > > In the mean case, I could imagine doing something like:
> > > m = zeros((1,1000))
> > > for row in X:
> > >   m = m + x
> > > m/N
> > > But it seems like this will be slow. I tried just numpy.mean(X)
> > > out of curiosity, but it took forever and finally ran out of
> > > memory. I assume it was forming a copy of the array in memory.
> >
> > Can you be a more explicit on how you are building X?  An
> > autocontained code example, with timings, is always nice to have.
> >
> > Cheers,
>
> I am building X just as you suggested:
> Setting up:
> desc = {'AccNumber':tables.Int32Col(), \
>
> 'dataI':tables.Float32Col(shape=(512,)), \
> 'dataQ':tables.Float32Col(shape=(512,))}
>             self.table = self.fileh.createTable(self.fileh.root, \
> 'SpectrometerTimeSeries', desc , '')
>             self.fileh.setNodeAttr(self.fileh.root, 'StartTime',
> time.asctime())
>
> Write data each iteration:
>                self.table.row['AccNumber'] = data['lastAccNum']
>                 self.table.row['dataI'] = dataA
>                 self.table.row['dataQ'] = dataB
>                 self.table.row.append()
>
> Periodically flush the data:
>         if now - self.LastUpdateTime > self.UpdatePeriod:
>                 self.table.flush()
>
> Writing the data is indeed very fast.
>
> I just tried timing the following:
>
> table = fh.root.SpectrometerTimeSeries
> def test():
>     tic = time.time()
>     m = np.zeros(512)
>     for x in table.iterrows():
>         m += x['dataI']
>     print time.time() - tic
>
> and it took 82 seconds for my table with 1.35 million rows, so that
> works out to ~33 MB per second, which is not too bad. I guess my
> dataset was just larger than I had realized. I would still appreciate
> any comments on the above code, if I am doing things correctly.


I see.  First of all, since the tables in PyTables are currently row 
wise ordered, you need to load complete rows whenever you iterate
through the table.  So, you are reading at 4MB*1.3e6/82 =~ 66 MB/s which 
is pretty good for a modern single hard disk.

If you want more speed, you have two options:

- Use LZO so as to compress your data (see the 'filters=' parameter of 
createTable()).  If your data is compressible enough you should be able 
to duplicate your I/O throughtput provided that you are using a 
relatively modern CPU.  See chapter 5 of User's Guide for more info 
about the speed-ups you can achieve by using compression.

- Save the columns in EArrays.  With this, you only have to read the 
data from one column out of the disk, halving the time required to do 
this.

Of course, you can combine both approaches for optimal results.

Regards,

-- 
Francesc Alted

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Data Format Suggestions

Reply via email to