Hi Jon Olav, A Thursday 11 June 2009 11:36:35 Jon Olav Vik escrigué: > I have a Pytables 2.0.4 VLArray, called "y", with about 6500 rows of about > 8500 atoms of shape (36,). The following line takes about 20 minutes to > run: i = sum(len(yi) for yi in y)
Yeah. I suppose that this is expected because HDF5 implements variable length types by using pointers on-disk, and this can be very slow due to the high latency of mechanical disks (solid state disks should help a lot here). > Question 1: Can I somehow access the length of a VLArray row without having > to read the entire row? Mmm, not at the moment, but that should be possible (and certainly useful). I've filed an enhancement ticket about this: http://pytables.org/trac/ticket/229 > > Question 2: Further on I only need to work with the last 20% or so of each > row. Is there an efficient way to slice from a row without having to load > it all from disk? > > for i in range(len(y)): > yj = y[i][-2000:] # not having to read y[i][:6500] > ... I'm afraid you can't. The thing is that the VL types cannot be divided and the entire data element must be transferred. See: http://www.hdfgroup.org/HDF5/doc/UG/11_Datatypes.html section 4.3.2.3 for more info on this. Regards, > > Thanks in advance for any tips. > > Regards, > Jon Olav > > > Background: > > If y were a Numpy array in memory, the summing would be fast, because each > array object remembers its shape. For the VLArray in the HDF5 file, I > realize now that I need to read all the data to compute the total number of > atoms. That's 6500 * 8500 * 36 * 8 = 16 GB (meaning about 13 MB/s for 20 > minutes). > > >From the timing below (and watching "top" for ages), I see that the > > (len(yi) > > for yi in y) spent almost all its time _waiting_ for disk access (status > 'D' = uninterruptible sleep, but the support staff tell me it means waiting > for disk). > > In [17]: time i = sum(len(yi) for yi in y) > CPU times: user 39.93 s, sys: 16.24 s, total: 56.16 s > Wall time: 1192.63 > > In [18]: y > Out[18]: > /ap/ph/y (VLArray(6561L,)) 'State vector' > atom = Float64Atom(shape=(36L,), dflt=0.0) > byteorder = 'little' > nrows = 6561 > flavor = 'numpy' > > In [20]: len(y[0]) > Out[20]: 8977 > > In [23]: ls -l vlarraytest.h5 > -rw-r--r-- 1 jonvi users 17377780785 ... > > > > --------------------------------------------------------------------------- >--- Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users