Re: [Pytables-users] Shuffle and performance

elias . collas Fri, 24 Aug 2007 06:53:56 -0700

Thanks for your prompt reply.

[EMAIL PROTECTED] wrote on 08/24/2007 06:57:23 
AM:


> Hi Elias,
> 
> A Thursday 23 August 2007, escriguéreu:
> > Francesc,
> >
> > Here's my setup:
> > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> >=-=-=-= PyTables version:  1.3
> > HDF5 version:      1.6.5
> > numarray version:  1.5.1
> > Zlib version:      1.2.1
> > BZIP2 version:     1.0.2 (30-Dec-2001)
> > Python version:    2.4.3 (#1, Apr 21 2006, 14:31:08)
> > [GCC 3.3.3 (SuSE Linux)]
> > Platform:          linux2-x86_64
> > Byte-ordering:     little
                       ^^^^^^   <--Notice this

Well since my platform is little-endian and the problem file is also 
little-endian, I ignored this. I think somehow 'h5import' was creating 
big-endian files on my little-endian machine, but I have not verified 
this.

I tried using 'ptrepack' on the 'new' file to remove the shuffle filter 
and did not notice any improvement. On the other hand, when I created a 
test file *without* shuffle and then used ptrepack to *add* the shuffle, I 
got some improvement (these are *much* smaller files):

$ python test_finder.py
Testing file noshuffle.h5
GRID 121731
fh.find_gpfb('121731') took 1.73 sec
Found 3 results for your search
fh.find('1121910', gpf=True) took 0.192 sec

Testing file repacked.h5
GRID 121731
fh.find_gpfb('121731') took 0.989 sec
Found 3 results for your search
fh.find('1121910', gpf=True) took 0.0993 sec

Perhaps these cases are not stressful enough to draw conclusions about the 
shuffle filter for the full size file. Also, I failed to mention that my 
'New' file was actually created with a file.copyNode() call after deleting 
and recreating a bad node. I'm planning to rebuild this file and I'll try 
it both with and without shuffle.

As an aside, its seems that 'ptrepack' doesn't worry about byte-order. I 
tried it on one of my 'old' big-endian files and not only did it take 
forever to complete, it corrupted most of the data, byte-ordering I 
assume.

> 
> So, you may want to try PyTables 2.0 or, if you want to stick with 1.3, 
> try disabling the shuffle filter (at the expense of reducing the 
> compression effectiveness) when creating the 'new' arrays.  My 
> recommendation, though, is you to switch to 2.0 as there are more 
> optimizations (like using numpy natively and others) that can help 
> improving your times still more.

Well, I have 2.0 built but not installed. My reluctance is to avoid 
breaking my production codebase, so I have to proceed cautiously. However, 
this will definitely motivate me to upgrade!

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Shuffle and performance

Reply via email to