Re: [Pytables-users] Shuffle and performance

Francesc Altet Wed, 29 Aug 2007 00:32:46 -0700

A Tuesday 28 August 2007, [EMAIL PROTECTED] escrigué:
> > Aha, so you are doing a binary search in an 'index' first; then it
> > is almost sure that most of the time is spent in performing the
> > look-up in this rank-1 array.  As you are doing binary search, and
> > the minimum amount of I/O chunk in HDF5 is precisely the chunksize,
> > having small chunksizes will favor the performance.  By looking at
> > your finding times, my guess is that your 'index' array is on-disk,
> > and the sparse access (i.e. the binary search) to it is your
> > bottleneck.
>
> While I think you are generally correct, the search times are
> somewhat deceptive, as there is a lot going on besides just finding
> an offset. I basically have to initialize finite-element (FE) data
> objects from the results of the pytables searches. In any case, if I
> understand you correctly, to make all my index arrays uncompressed
> Arrays rather than CArrays would be optimal from a performance
> point-of-view?


I don't know, but if you are reading a lot the index arrays then 
avoiding compression completely would be a good move.  Array objects 
are the simplest and hence, the fastest, objects to read indeed.

> If not, then is there a way to determine optimal chunkshape?
> FE uses unstructured grids that are not so trivial to 
> model in HDF5. The solution I use is to store different elements as
> separate datasets within the same group like so:
>
> /model/geom2/eid/type1 {17400}
> /model/geom2/eid/type2 {61/512}
> /model/geom2/eid/type3 {1567}
> etc.
>
> Associated data arrays have different shapes depending on element
> topology. Another thing that slows down is that the /model/geom2/eid
> group has to be walked to binary search each leaf. Maybe not optimal,
> but it is clean and easy to understand.

Determining the optimal chunksize depends largely on your problem.  As I 
said before, Array objects are good candidates to try with (they 
are 'contiguous' datasets, in HDF5 parlance, instead of 'chunked' ones, 
so you don't have to worry about setting chunksizes), and if they are 
small enough and you have to read them frequently enough, then they 
will be placed in the HDF5 internal cache.  This should boost your 
searches considerably, I guess.  Another possibility is to load your 
index matrices in NumPy/numarray arrays and do the lookups completely 
in-memory objects. This should certainly be the faster approach.

> > Unfortunately, you are not sending the chunksizes for the 1-rank
> > index array, but most probably the chunksize for 'old' files must
> > be rather small compared with the 'new' arrays.
>
> Yes. You certainly know your business. When I originally setup
> 'h5import' to do my conversions, I just used a
> CHUNKED-DIMENSION-SIZES parameter of 100, not knowing any better.
> PyTables 1.3 chunked the entire array. When I did a ptrepack
> --upgrade-flavors, the chunks went down to 1024 and the performance
> was again reasonable.

Aha, that explains the performance pattern you were reporting.

> I'm pressing ahead with upgrading to 2.0. I'm seeing significant
> improvements indicating that this is the right move. Fortunately, I
> only have the one database with the flavors set to 'numarray', which
> would definitely cause me problems since a lot of my client scripts
> use numarray (or even Numeric) to manipulate arrays pulled from the
> HDF5 files.

You should know that PyTables 2.0 does support numarray/Numeric right 
out-of-the-box.  The only thing to have in mind is that, although NumPy 
is used internally, by using the appropriate flavors, you can continue 
obtaining numarray/Numeric objects out of PyTables 2.0.  So, if you 
want to continue using numarray (which I do not recommend, at least for 
the long run), just don't pass the '--upgrade-flavors' flag 
to 'ptrepack'.

> Bravo for amazing software and astonishing support!

Thanks! :)

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Shuffle and performance

Reply via email to