Re: [Pytables-users] Shuffle and performance

Francesc Altet Sat, 25 Aug 2007 09:05:21 -0700

A Friday 24 August 2007, escriguéreu:
> Thanks for your prompt reply.
>
> [EMAIL PROTECTED] wrote on 08/24/2007
> 06:57:23
>
> AM:
> > Hi Elias,
> >
> > A Thursday 23 August 2007, escriguéreu:
> > > Francesc,
> > >
> > > Here's my setup:
> > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > >=-=- =-=-=-= PyTables version:  1.3
> > > HDF5 version:      1.6.5
> > > numarray version:  1.5.1
> > > Zlib version:      1.2.1
> > > BZIP2 version:     1.0.2 (30-Dec-2001)
> > > Python version:    2.4.3 (#1, Apr 21 2006, 14:31:08)
> > > [GCC 3.3.3 (SuSE Linux)]
> > > Platform:          linux2-x86_64
> > > Byte-ordering:     little
>
>                        ^^^^^^   <--Notice this
>
> Well since my platform is little-endian and the problem file is also
> little-endian, I ignored this. I think somehow 'h5import' was
> creating big-endian files on my little-endian machine, but I have not
> verified this.
>
> I tried using 'ptrepack' on the 'new' file to remove the shuffle
> filter and did not notice any improvement. On the other hand, when I
> created a test file *without* shuffle and then used ptrepack to *add*
> the shuffle, I got some improvement (these are *much* smaller files):
>
> $ python test_finder.py
> Testing file noshuffle.h5
> GRID 121731
> fh.find_gpfb('121731') took 1.73 sec
> Found 3 results for your search
> fh.find('1121910', gpf=True) took 0.192 sec
>
> Testing file repacked.h5
> GRID 121731
> fh.find_gpfb('121731') took 0.989 sec
> Found 3 results for your search
> fh.find('1121910', gpf=True) took 0.0993 sec
>
> Perhaps these cases are not stressful enough to draw conclusions
> about the shuffle filter for the full size file. Also, I failed to
> mention that my 'New' file was actually created with a
> file.copyNode() call after deleting and recreating a bad node. I'm
> planning to rebuild this file and I'll try it both with and without
> shuffle.


Yeah, that's a bit strange.  If 're-adding' shuffle is actually 
improving your search times, then perhaps it is not the actual problem.
Now, I think that the main issue should be the length of the chunksize 
of 'new' files.  Can you run the 'h5ls -v' utility that comes with HDF5 
and send the 'Chunks:' fields of the output for 
the '/results/oef1/quad4' dataset for both 'old' and 'new' files?

Also, it would be nice to know the way you are doing the search process 
(sequential or sparse access?);  if you can send the search algorithm 
that would be nice.  The only thing that comes to my mind is that, if 
your search process is based on a sparse access pattern, having a large 
chunksize can highly penalize the times;  in this case, using PyTables 
2.0, which creates far smaller chunksizes by default, will help.  If 
you are using sequential access, then I don't really understand what 
can be the cause of the slowdown.

> As an aside, its seems that 'ptrepack' doesn't worry about
> byte-order. I tried it on one of my 'old' big-endian files and not
> only did it take forever to complete, it corrupted most of the data,
> byte-ordering I assume.

Mmm, can you try if the ptrepack that comes with PyTables 2.0 exposes 
the same bad behaviour?  In that case, I'd like to fix this.

> > So, you may want to try PyTables 2.0 or, if you want to stick with
> > 1.3, try disabling the shuffle filter (at the expense of reducing
> > the compression effectiveness) when creating the 'new' arrays.  My
> > recommendation, though, is you to switch to 2.0 as there are more
> > optimizations (like using numpy natively and others) that can help
> > improving your times still more.
>
> Well, I have 2.0 built but not installed. My reluctance is to avoid
> breaking my production codebase, so I have to proceed cautiously.
> However, this will definitely motivate me to upgrade!

Well, if you already have built PyTables 2.0, you don't really need to 
install it in order to have it a try: just make the PYTHONPATH 
environment variable to point to where 2.0 is, and you are done (well, 
if you are trying the utilities that comes with PyTables 2.0, you 
should specifiy the complete path to reach them as well).

Hope that helps,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Shuffle and performance

Reply via email to