Hi Elias, A Thursday 23 August 2007, escriguéreu: > Francesc, > > Here's my setup: > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >=-=-=-= PyTables version: 1.3 > HDF5 version: 1.6.5 > numarray version: 1.5.1 > Zlib version: 1.2.1 > BZIP2 version: 1.0.2 (30-Dec-2001) > Python version: 2.4.3 (#1, Apr 21 2006, 14:31:08) > [GCC 3.3.3 (SuSE Linux)] > Platform: linux2-x86_64 > Byte-ordering: little > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >=-=-=-= > > I recently switched from 'h5import' to PyTables to convert the output > from large finite element models into HDF5 format. I like using the > PyTables approach because it gives me more control than the shell > scripts that I cobbled together to use 'h5import' > > However, the most recent file takes much longer to search. Here is > the results of a simple test I ran with old and new databases: > > 'New': > $ python test_finder.py > Found 3 results for your search > CQUAD4 1121910 > fh.find('1121910') took 2.37 sec > Found 3 results for your search > fh.find('1121910', gpf=True) took 9.44 sec > > 'Old': > $ python test_finder.py > Found 3 results for your search > CQUAD4 1121910 > fh.find('1121910') took 0.664 sec > Found 3 results for your search > fh.find('1121910', gpf=True) took 0.638 sec > > The only difference I could detect between the two files was that the > PyTables version is the 'shuffle' parameter. Here is some ptdump > output of some nodes: > 'New': > $ ptdump -v xxx_lev_1_1.h5:/results/oef1/quad4 > /results/oef1/quad4 (EArray(1022L, 17759L, 3L), shuffle, zlib(6)) '' > atom = Atom(dtype='Float32', shape=(0, 17759L, 3L), > flavor='numarray') nrows = 1022 > extdim = 0 > flavor = 'numarray' > byteorder = 'little' ^^^^^^^^ <- Notice this > > 'Old': > $ ptdump -v xxx_lev_0.h5:/results/oef1/quad4 > /cluster/stress/methods/local/lib/python2.4/site-packages/tables/File >.py:227: UserWarning: file ``xxx_lev_0.h5`` exists and it is an HDF5 > file, but it does not have a PyTables format; I will try to do my > best to guess what's there using HDF5 metadata > METADATA_CACHE_SIZE, nodeCacheSize) > /results/oef1/quad4 (EArray(1018L, 17402L, 3L), zlib(6)) '' > atom = Atom(dtype='Float32', shape=(0, 17402L, 3L), > flavor='numarray') nrows = 1018 > extdim = 0 > flavor = 'numarray' > byteorder = 'big' ^^^^^ <- Notice this > > My client code is completely unchanged with this testing: only the > databases were created by two different methods. I have yet to do > more testing with smaller files (these are ~2.2G). I read the section > on shuffling in the manual where it suggest that shuffle will > actually improve throughput. but this is the only difference I could > detect. It is not a trivial matter to produce these large files, so I > need to get it right. I know it's not much to go on, but any > suggestions are appreciated.
As I remarked above, another difference is that the 'new' files are converted to little-endian byteorder, and that could affect performance if you process those files on a big-endian machine. However, my guess is that the real problem in this case could effectively lie in the shuffle filter. The thing is that in PyTables 1.x series, the algorithm for computing the chunksize (i.e. the size where compression applies) was not very fine-tuned, and the computed size for it can be as high as 600KB, putting too much stress on the shuffle filter. This has been somewhat bettered in 2.x series, so that the chunksize for your files (~2.2 GB) would be something like 32KB or 64KB, which is a more reasonable figure for shuffling (besides of allowing far better performance in sparse reads). So, you may want to try PyTables 2.0 or, if you want to stick with 1.3, try disabling the shuffle filter (at the expense of reducing the compression effectiveness) when creating the 'new' arrays. My recommendation, though, is you to switch to 2.0 as there are more optimizations (like using numpy natively and others) that can help improving your times still more. Cheers, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users