Re: [Pytables-users] Performance issues when writing a large number of arrays

David Fokkema Sun, 27 Sep 2009 04:46:01 -0700

Hi Abiel,

On Sat, 2009-09-26 at 09:56 -0400, Abiel Reinhart wrote:
> David,
> 
> Thanks for the reply. I have also looked at the VLArray approach, but
> I am I looking to build an interactive application and I am a bit
> confused about whether VLArray is compatible with this. For example,
> people can naturally be expected to remove individual numpy arrays
> from the database, but I do not see any method for removing rows in a
> VLArray without rewriting the whole object. Even if there was, it
> would seem that this would necessitate going to the Table that links
> the keys and index positions and updating all the index positions.
> Maybe I'm missing something?


Ah... I don't know. I think you should not remove the array, but rather
set it to zero (to keep the indexes the same). However, I don't know if
this will happily reclaim storage space. Once in a while, I guess you
should 'compact' everything by rewriting the VLArray and updating _all_
index positions (not nice). In this sense, maybe PyTables is not for
you. I guess while PyTables is very good in handling lots and lots of
data, it is not very good in many inserts / deletes with objects that
don't fit in a table and essentially you want to keep an index of.
SQLite might serve you better in this case.

Maybe Francesc can comment on that when he has time or is back from his
trip.

> On an another note, it usually takes me 20-30 seconds to run your
> example, despite my having a Core 2 Duo with 2GB of ram. Perhaps this
> is due to differences in storage technology, as in 5400rpm vs. SSD.

Strange. I added a few lines for timing (I really wasn't precise,
sorry, :-/ ) and ran it again a few times. The first time all the
libraries need to be cached (disk cache) but then it runs pretty much
CPU bound, I think. Since I don't have SSD, disk is not the issue and it
is not multi-processed so your Core 2 Duo won't help either. What's your
clock speed? If it is close to 1.6 Ghz that might be it. But really, I
run this thing between 10.5 and 12.5 seconds.

David


> 
> On Sat, Sep 26, 2009 at 5:57 AM, David Fokkema <dfokk...@ileos.nl> wrote:
> > Hi Abiel,
> >
> > On Fri, 2009-09-25 at 23:07 -0400, Abiel Reinhart wrote:
> >> I am attempting to store a large number of moderately-sized
> >> variable-length numpy arrays in a PyTables database, where each array
> >> can be referred to by a string key. Looking through the mailing list
> >> archives, it seems that one possible solution to this problem is to
> >> simply create a large number of Array objects.
> >
> > <snip>
> >
> > Another solution is to create a VLArray (variable-length array). Like
> > this:
> >>>> import tables
> >>>> import numpy as np
> >>>> h5f = tables.openFile('test.h5', 'w')
> >>>> h5f.createVLArray('/', 'test', tables.Int32Atom())
> > /test (VLArray(0,)) ''
> >  atom = Int32Atom(shape=(), dflt=0)
> >  byteorder = 'little'
> >  nrows = 0
> >  flavor = 'numpy'
> >>>> for i in range(10000):
> > ...    a1 = np.arange(np.random.randint(1000, 10000))
> > ...    h5f.root.test.append(a1)
> >
> > On my puny Eee PC (Atom 1.6 Ghz variable), linux, python 2.6 it runs in
> > roughly 7 seconds, while the arrays are recreated throughout the loop
> > and have variable size. So already it is faster than your test. You can
> > reference a particular array with h5f.root.test[idx] where idx can of
> > course be textual, in the sense that h5f.root.test[int(idx)] can be used
> > if idx is '1233'.
> >
> > At least, this was suggested by Francesc when I brought up my own
> > problem on this list.
> >
> > Good luck,
> >
> > David
> >
> >
> > ------------------------------------------------------------------------------
> > Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> > is the only developer event you need to attend this year. Jumpstart your
> > developing skills, take BlackBerry mobile applications to market and stay
> > ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
> > http://p.sf.net/sfu/devconf
> > _______________________________________________
> > Pytables-users mailing list
> > Pytables-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> 
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay 
> ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
> http://p.sf.net/sfu/devconf
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

import tables
import numpy as np
import time
t0 = time.time()
np.random.seed(1)
h5f = tables.openFile('test.h5', 'w')
h5f.createVLArray('/', 'test', tables.Int32Atom())
for i in range(10000):
   a1 = np.arange(np.random.randint(1000, 10000))
   h5f.root.test.append(a1)
t1 = time.time()
h5f.close()
print t1 - t0

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Performance issues when writing a large number of arrays

Reply via email to