Re: [Pytables-users] Shuffle and performance

Francesc Altet Tue, 28 Aug 2007 04:33:50 -0700

A Monday 27 August 2007, escriguéreu:
> > Yeah, that's a bit strange.  If 're-adding' shuffle is actually
> > improving your search times, then perhaps it is not the actual
> > problem. Now, I think that the main issue should be the length of
> > the chunksize of 'new' files.  Can you run the 'h5ls -v' utility
> > that comes with HDF5 and send the 'Chunks:' fields of the output
> > for
> > the '/results/oef1/quad4' dataset for both 'old' and 'new' files?
>
> $ h5ls -v old.h5/results/oef1/quad4
> Opened "old.h5" with sec2 driver.
> results/oef1/quad4       Dataset {1018/Inf, 17402/Inf, 3/3}
>     Location:  0:1:0:28034319
>     Links:     1
>     Modified:  2007-01-04 15:45:37 EST
>     Chunks:    {119, 100, 3} 142800 bytes
>     Storage:   212582832 logical bytes, 196302976 allocated bytes,
> 108.29% utilization
>     Filter-0:  deflate-1 OPT {6}
>     Type:      IEEE 32-bit big-endian float
> $ h5ls -v new.h5/results/oef1/quad4
> Opened "new.h5" with sec2 driver.
> results/oef1/quad4       Dataset {1022/Inf, 17759/17759, 3/3}
>     Attribute: CLASS     scalar
>         Type:      7-byte null-terminated ASCII string
>         Data:  "EARRAY"
>     Attribute: EXTDIM    scalar
>         Type:      native int
>         Data:  0
>     Attribute: FLAVOR    scalar
>         Type:      9-byte null-terminated ASCII string
>         Data:  "numarray"
>     Attribute: VERSION   scalar
>         Type:      4-byte null-terminated ASCII string
>         Data:  "1.3"
>     Attribute: TITLE     scalar
>         Type:      1-byte null-terminated ASCII string
>         Data:  ""
>     Location:  0:1:0:1126352
>     Links:     1
>     Modified:  2007-08-21 08:08:41 EDT
>     Chunks:    {1, 17759, 3} 213108 bytes
>     Storage:   217796376 logical bytes, 183047210 allocated bytes,
> 118.98% utilization
>     Filter-0:  shuffle-2 OPT {4}
>     Filter-1:  deflate-1 OPT {6}
>     Type:      native float
>
> > Also, it would be nice to know the way you are doing the search
> > process (sequential or sparse access?);  if you can send the search
> > algorithm that would be nice.  The only thing that comes to my mind
> > is that, if your search process is based on a sparse access
> > pattern, having a large chunksize can highly penalize the times; 
> > in this case, using PyTables 2.0, which creates far smaller
> > chunksizes by default, will help.  If you are using sequential
> > access, then I don't really understand what can be the cause of the
> > slowdown.
>
> Well, the related arrays are stored in the same order. Then I use a
> simple binary search of an 'index' to determine the offset to find
> the related data. For example, say that in a mesh, the index is a
> rank-1 array of integer identifiers, and the associated space
> coordinates are stored as a rank-2 array, where the second dimension
> is like a tuple of (x, y, z).


Aha, so you are doing a binary search in an 'index' first; then it is 
almost sure that most of the time is spent in performing the look-up in 
this rank-1 array.  As you are doing binary search, and the minimum 
amount of I/O chunk in HDF5 is precisely the chunksize, having small 
chunksizes will favor the performance.  By looking at your finding 
times, my guess is that your 'index' array is on-disk, and the sparse 
access (i.e. the binary search) to it is your bottleneck. 

Unfortunately, you are not sending the chunksizes for the 1-rank index 
array, but most probably the chunksize for 'old' files must be rather 
small compared with the 'new' arrays.  In this case, and as I said in 
other message, creating the 'new' files with PyTables 2.0 will help 
because it uses far smaller chunksizes by default.  Also, PyTables 2.0 
will let you to set even smaller chunksizes than the default (see the 
new 'chunkshape' parameter in the create*Array factories), allowing a 
better fine-tuning of query times.

As an aside and just in case you are not aware of that:  PyTables Pro 
allows to index columns of tables and then doing binary searches in a 
very quick way.  So, if you want to get maximum performance in your 
lookups, one possibility is to declare a Table with a single column 
(the indexes), index it, and then do the query:

offset = [r['index'] for r in table.where('index == 154092')][0]

Of course, all the parameters in the Pro indexing engine has already 
been fine-tuned so as to get pretty optimal query times (see [1] for a 
detailed description on how Pro indexes work and their performance).

[snip]
> The new ptrepack seems to work OK. I did observe that if I used
> --complevel and --shuffle at the same time, shuffle was always set to
> "off" no matter the value of --shuffle.

This is a bug in ptrepack.  The attached patch should solve the problem.

> Unfortunately, I can't test 
> the effect of the new files:
>
> $ python test_finder.py
> Testing file /cluster/stress/p20loads/gac/lev_0_test.hdf5
> HDF5-DIAG: Error detected in HDF5 library version: 1.6.5 thread 0. 
> Back trace follows.
>   #000: H5A.c line 457 in H5Aopen_name(): attribute not found
>     major(18): Attribute layer
>     minor(05): Bad value
>   #001: H5A.c line 404 in H5A_get_index(): attribute not found
>     major(18): Attribute layer
>     minor(48): Object not found
> Segmentation fault
>
> So I tried with PyTables 2.0:
> $ python test_finder.py
> Testing file /cluster/stress/p20loads/gac/lev_0_test.hdf5
> Traceback (most recent call last):
>   File "test_finder.py", line 16, in ?
>     fh.find_gpfb('121731')
>   File "/cluster/stress/u308168/public_html/pyloads/model/finder.py",
> line 210, in find_gpfb
>     r = nasob.NodalResult(self.fileh, g, balance=not oelop)
>   File "../nasob.py", line 375, in __init__
>     elements = grid.elements
>   File "../nasob.py", line 52, in _elements
>     self._elist.append(Result(self.fileh, eid, ogpf=True))
>   File "../nasob.py", line 288, in __init__
>     g.ogpf.T1 = g.ogpf.t1 = g.fx = g.FX = g.ogpf[:,0]
> AttributeError: 'numpy.ndarray' object has no attribute 'T1'
> Closing remaining open files:
> /cluster/stress/p20loads/gac/lev_0_test.hdf5... done
>
> I guess I'll have to read the migration docs ;)

Well, I think so ;)

[1] http://www.carabos.com/docs/OPSI-indexes.pdf

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

Index: tables/scripts/ptrepack.py
===================================================================
--- tables/scripts/ptrepack.py	(revision 3206)
+++ tables/scripts/ptrepack.py	(working copy)
@@ -379,10 +379,11 @@
         filters = None
     else:
         if complevel is None: complevel = 0
-        if complevel > 0 and shuffle is None:
-            shuffle = True
-        else:
-            shuffle = False
+        if shuffle is None:
+            if complevel > 0:
+                shuffle = True
+            else:
+                shuffle = False
         if complib is None: complib = "zlib"
         if fletcher32 is None: fletcher32 = False
         filters = Filters(complevel=complevel, complib=complib,

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Shuffle and performance

Reply via email to