[Pytables-users] Shuffle and performance

elias . collas Thu, 23 Aug 2007 12:27:12 -0700

Francesc,

Here's my setup:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  1.3
HDF5 version:      1.6.5
numarray version:  1.5.1
Zlib version:      1.2.1
BZIP2 version:     1.0.2 (30-Dec-2001)
Python version:    2.4.3 (#1, Apr 21 2006, 14:31:08)
[GCC 3.3.3 (SuSE Linux)]
Platform:          linux2-x86_64
Byte-ordering:     little
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


I recently switched from 'h5import' to PyTables to convert the output from 
large finite element models into HDF5 format. I like using the PyTables 
approach because it gives me more control than the shell scripts that I 
cobbled together to use 'h5import'

However, the most recent file takes much longer to search. Here is the 
results of a simple test I ran with old and new databases:

'New':
$ python test_finder.py
Found 3 results for your search
CQUAD4 1121910
fh.find('1121910') took 2.37 sec
Found 3 results for your search
fh.find('1121910', gpf=True) took 9.44 sec

'Old':
$ python test_finder.py
Found 3 results for your search
CQUAD4 1121910
fh.find('1121910') took 0.664 sec
Found 3 results for your search
fh.find('1121910', gpf=True) took 0.638 sec

The only difference I could detect between the two files was that the 
PyTables version is the 'shuffle' parameter. Here is some ptdump output of 
some nodes:
'New':
$ ptdump -v xxx_lev_1_1.h5:/results/oef1/quad4
/results/oef1/quad4 (EArray(1022L, 17759L, 3L), shuffle, zlib(6)) ''
  atom = Atom(dtype='Float32', shape=(0, 17759L, 3L), flavor='numarray')
  nrows = 1022
  extdim = 0
  flavor = 'numarray'
  byteorder = 'little'

'Old':
$ ptdump -v xxx_lev_0.h5:/results/oef1/quad4
/cluster/stress/methods/local/lib/python2.4/site-packages/tables/File.py:227: 
UserWarning: file ``xxx_lev_0.h5`` exists and it is an HDF5 file, but it 
does not have a PyTables format; I will try to do my best to guess what's 
there using HDF5 metadata
  METADATA_CACHE_SIZE, nodeCacheSize)
/results/oef1/quad4 (EArray(1018L, 17402L, 3L), zlib(6)) ''
  atom = Atom(dtype='Float32', shape=(0, 17402L, 3L), flavor='numarray')
  nrows = 1018
  extdim = 0
  flavor = 'numarray'
  byteorder = 'big'

My client code is completely unchanged with this testing: only the 
databases were created by two different methods. I have yet to do more 
testing with smaller files (these are ~2.2G). I read the section on 
shuffling in the manual where it suggest that shuffle will actually 
improve throughput. but this is the only difference I could detect. It is 
not a trivial matter to produce these large files, so I need to get it 
right. I know it's not much to go on, but any suggestions are appreciated.

Elias Collas
Stress Methods
Gulfstream Aerospace Corp

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] Shuffle and performance

Reply via email to