Hi Jon Olav,

A Thursday 11 June 2009 11:36:35 Jon Olav Vik escrigué:
> I have a Pytables 2.0.4 VLArray, called "y", with about 6500 rows of about
> 8500 atoms of shape (36,). The following line takes about 20 minutes to
> run: i = sum(len(yi) for yi in y)

Yeah.  I suppose that this is expected because HDF5 implements variable length 
types by using pointers on-disk, and this can be very slow due to the high 
latency of mechanical disks (solid state disks should help a lot here).

> Question 1: Can I somehow access the length of a VLArray row without having
> to read the entire row?

Mmm, not at the moment, but that should be possible (and certainly useful).  
I've filed an enhancement ticket about this:

http://pytables.org/trac/ticket/229

>
> Question 2: Further on I only need to work with the last 20% or so of each
> row. Is there an efficient way to slice from a row without having to load
> it all from disk?
>
> for i in range(len(y)):
>     yj = y[i][-2000:] # not having to read y[i][:6500]
>     ...

I'm afraid you can't.  The thing is that the VL types cannot be divided and 
the entire data element must be transferred.  See:

http://www.hdfgroup.org/HDF5/doc/UG/11_Datatypes.html

section 4.3.2.3 for more info on this.

Regards,

>
> Thanks in advance for any tips.
>
> Regards,
> Jon Olav
>
>
> Background:
>
> If y were a Numpy array in memory, the summing would be fast, because each
> array object remembers its shape. For the VLArray in the HDF5 file, I
> realize now that I need to read all the data to compute the total number of
> atoms. That's 6500 * 8500 * 36 * 8 = 16 GB (meaning about 13 MB/s for 20
> minutes).
>
> >From the timing below (and watching "top" for ages), I see that the
> > (len(yi)
>
> for yi in y) spent almost all its time _waiting_ for disk access (status
> 'D' = uninterruptible sleep, but the support staff tell me it means waiting
> for disk).
>
> In [17]: time i = sum(len(yi) for yi in y)
> CPU times: user 39.93 s, sys: 16.24 s, total: 56.16 s
> Wall time: 1192.63
>
> In [18]: y
> Out[18]:
> /ap/ph/y (VLArray(6561L,)) 'State vector'
>   atom = Float64Atom(shape=(36L,), dflt=0.0)
>   byteorder = 'little'
>   nrows = 6561
>   flavor = 'numpy'
>
> In [20]: len(y[0])
> Out[20]: 8977
>
> In [23]: ls -l vlarraytest.h5
> -rw-r--r-- 1 jonvi users 17377780785 ...
>
>
>
> ---------------------------------------------------------------------------
>--- Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables unlimited
> royalty-free distribution of the report engine for externally facing
> server and web deployment.
> http://p.sf.net/sfu/businessobjects
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users

-- 
Francesc Alted

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to