Re: [Pytables-users] Shuffle and performance

Francesc Altet Fri, 24 Aug 2007 05:36:12 -0700

Hi Elias,

A Thursday 23 August 2007, escriguéreu:
> Francesc,
>
> Here's my setup:
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>=-=-=-= PyTables version:  1.3
> HDF5 version:      1.6.5
> numarray version:  1.5.1
> Zlib version:      1.2.1
> BZIP2 version:     1.0.2 (30-Dec-2001)
> Python version:    2.4.3 (#1, Apr 21 2006, 14:31:08)
> [GCC 3.3.3 (SuSE Linux)]
> Platform:          linux2-x86_64
> Byte-ordering:     little
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>=-=-=-=
>
> I recently switched from 'h5import' to PyTables to convert the output
> from large finite element models into HDF5 format. I like using the
> PyTables approach because it gives me more control than the shell
> scripts that I cobbled together to use 'h5import'
>
> However, the most recent file takes much longer to search. Here is
> the results of a simple test I ran with old and new databases:
>
> 'New':
> $ python test_finder.py
> Found 3 results for your search
> CQUAD4 1121910
> fh.find('1121910') took 2.37 sec
> Found 3 results for your search
> fh.find('1121910', gpf=True) took 9.44 sec
>
> 'Old':
> $ python test_finder.py
> Found 3 results for your search
> CQUAD4 1121910
> fh.find('1121910') took 0.664 sec
> Found 3 results for your search
> fh.find('1121910', gpf=True) took 0.638 sec
>
> The only difference I could detect between the two files was that the
> PyTables version is the 'shuffle' parameter. Here is some ptdump
> output of some nodes:
> 'New':
> $ ptdump -v xxx_lev_1_1.h5:/results/oef1/quad4
> /results/oef1/quad4 (EArray(1022L, 17759L, 3L), shuffle, zlib(6)) ''
>   atom = Atom(dtype='Float32', shape=(0, 17759L, 3L),
> flavor='numarray') nrows = 1022
>   extdim = 0
>   flavor = 'numarray'
>   byteorder = 'little'
                ^^^^^^^^   <- Notice this
>
> 'Old':
> $ ptdump -v xxx_lev_0.h5:/results/oef1/quad4
> /cluster/stress/methods/local/lib/python2.4/site-packages/tables/File
>.py:227: UserWarning: file ``xxx_lev_0.h5`` exists and it is an HDF5
> file, but it does not have a PyTables format; I will try to do my
> best to guess what's there using HDF5 metadata
>   METADATA_CACHE_SIZE, nodeCacheSize)
> /results/oef1/quad4 (EArray(1018L, 17402L, 3L), zlib(6)) ''
>   atom = Atom(dtype='Float32', shape=(0, 17402L, 3L),
> flavor='numarray') nrows = 1018
>   extdim = 0
>   flavor = 'numarray'
>   byteorder = 'big'
                ^^^^^   <- Notice this
>
> My client code is completely unchanged with this testing: only the
> databases were created by two different methods. I have yet to do
> more testing with smaller files (these are ~2.2G). I read the section
> on shuffling in the manual where it suggest that shuffle will
> actually improve throughput. but this is the only difference I could
> detect. It is not a trivial matter to produce these large files, so I
> need to get it right. I know it's not much to go on, but any
> suggestions are appreciated.


As I remarked above, another difference is that the 'new' files are 
converted to little-endian byteorder, and that could affect performance 
if you process those files on a big-endian machine.

However, my guess is that the real problem in this case could 
effectively lie in the shuffle filter.  The thing is that in PyTables 
1.x series, the algorithm for computing the chunksize (i.e. the size 
where compression applies) was not very fine-tuned, and the computed 
size for it can be as high as 600KB, putting too much stress on the 
shuffle filter.  This has been somewhat bettered in 2.x series, so that 
the chunksize for your files (~2.2 GB) would be something like 32KB or 
64KB, which is a more reasonable figure for shuffling (besides of 
allowing far better performance in sparse reads).

So, you may want to try PyTables 2.0 or, if you want to stick with 1.3, 
try disabling the shuffle filter (at the expense of reducing the 
compression effectiveness) when creating the 'new' arrays.  My 
recommendation, though, is you to switch to 2.0 as there are more 
optimizations (like using numpy natively and others) that can help 
improving your times still more.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Shuffle and performance

Reply via email to