Re: [Pytables-users] Shuffle and performance

elias . collas Tue, 28 Aug 2007 12:41:57 -0700

> 
> Aha, so you are doing a binary search in an 'index' first; then it is 
> almost sure that most of the time is spent in performing the look-up in 
> this rank-1 array.  As you are doing binary search, and the minimum 
> amount of I/O chunk in HDF5 is precisely the chunksize, having small 
> chunksizes will favor the performance.  By looking at your finding 
> times, my guess is that your 'index' array is on-disk, and the sparse 
> access (i.e. the binary search) to it is your bottleneck.


While I think you are generally correct, the search times are somewhat 
deceptive, as there is a lot going on besides just finding an offset. I 
basically have to initialize finite-element (FE) data objects from the 
results of the pytables searches. In any case, if I understand you 
correctly, to make all my index arrays uncompressed Arrays rather than 
CArrays would be optimal from a performance point-of-view? If not, then is 
there a way to determine optimal chunkshape? FE uses unstructured grids 
that are not so trivial to model in HDF5. The solution I use is to store 
different elements as separate datasets within the same group like so:

/model/geom2/eid/type1 {17400}
/model/geom2/eid/type2 {61/512}
/model/geom2/eid/type3 {1567}
etc.

Associated data arrays have different shapes depending on element 
topology. Another thing that slows down is that the /model/geom2/eid group 
has to be walked to binary search each leaf. Maybe not optimal, but it is 
clean and easy to understand.

> Unfortunately, you are not sending the chunksizes for the 1-rank index 
> array, but most probably the chunksize for 'old' files must be rather 
> small compared with the 'new' arrays. 

Yes. You certainly know your business. When I originally setup 'h5import' 
to do my conversions, I just used a CHUNKED-DIMENSION-SIZES parameter of 
100, not knowing any better. PyTables 1.3 chunked the entire array. When I 
did a ptrepack --upgrade-flavors, the chunks went down to 1024 and the 
performance was again reasonable.

I'm pressing ahead with upgrading to 2.0. I'm seeing significant 
improvements indicating that this is the right move. Fortunately, I only 
have the one database with the flavors set to 'numarray', which would 
definitely cause me problems since a lot of my client scripts use numarray 
(or even Numeric) to manipulate arrays pulled from the HDF5 files.

Bravo for amazing software and astonishing support!

Elias Collas
Stress Methods Group
Gulfstream Aerospace Corp

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Shuffle and performance

Reply via email to