Re: [Pytables-users] Multithreaded decompress unexpectedly does not help

Francesc Alted Thu, 06 Dec 2012 03:49:50 -0800

On 12/5/12 7:55 PM, Alvaro Tejero Cantero wrote:
> My system was benched for reads and writes with Blosc[1]:
>
> with pt.openFile(paths.braw(block), 'r') as handle:
> pt.setBloscMaxThreads(1)
>     %timeit  a = handle.root.raw.c042[:]
> pt.setBloscMaxThreads(6)
>     %timeit  a = handle.root.raw.c042[:]
> pt.setBloscMaxThreads(11)
>     %timeit  a = handle.root.raw.c042[:]
>     print handle.root.raw._v_attrs.FILTERS
>     print handle.root.raw.c042.__sizeof__()
>     print handle.root.raw.c042
>
> gives
>
> 1 loops, best of 3: 483 ms per loop
> 1 loops, best of 3: 782 ms per loop
> 1 loops, best of 3: 663 ms per loop
> Filters(complevel=5, complib='blosc', shuffle=True, fletcher32=False)
> 104
> /raw/c042 (CArray(303390000,), shuffle, blosc(5)) ''
>
> I can't understand what is going on, for the life of me. These 
> datasets use int16 atoms and at Blosc complevel=5 used to compress by 
> a factor of about 2. Even for such low compression ratios there should 
> be huge differences between single- and multi-threaded reads.
>
> Do you have any clue?


Yeah, welcome to the wonderful art of fine tuning.  Fortunately we have 
a machine which is pretty identical to yours (hey, your computer was too 
good in Blosc benchmarks so as to ignore it :), so I can reproduce your 
issue:

In [3]: a = ((np.random.rand(3e8))*100).astype('i2')

In [4]: f = tb.openFile("test.h5", "w")

In [5]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, 
filters=tb.Filters(5, complib="blosc"))

In [6]: act[:] = a

In [7]: f.flush()

In [8]: ll test.h5
-rw-rw-r-- 1 faltet 301719914 Dec  6 04:55 test.h5

This random set of numbers is close to your array in size (~3e8 
elements), and also has a similar compression factor (~2x).  Now the 
timings (using 6 cores by default):

In [9]: timeit act[:]
1 loops, best of 3: 441 ms per loop

In [11]: tb.setBloscMaxThreads(1)
Out[11]: 6

In [12]: timeit act[:]
1 loops, best of 3: 347 ms per loop


So yeah, that might seem a bit disappointing.  It turns out that the 
default chunksize for PyTables is tuned so as to balance among 
sequential and random reads.  If what you want is to optimize only for 
sequential reads (apparently this is what you are after, right?), then 
it normally helps to increase the chunksize.  For example, by doing some 
quick trials, I determined that a chunksize of 2 MB is pretty optimal 
for sequential access:

In [44]: f.removeNode(f.root.act)

In [45]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, 
filters=tb.Filters(5, complib="blosc"), chunkshape=(2**20,))

In [46]: act[:] = a

In [47]: tb.setBloscMaxThreads(1)
Out[47]: 6

In [48]: timeit act[:]
1 loops, best of 3: 334 ms per loop

In [49]: tb.setBloscMaxThreads(3)
Out[49]: 1

In [50]: timeit act[:]
1 loops, best of 3: 298 ms per loop

In [51]: tb.setBloscMaxThreads(6)
Out[51]: 3

In [52]: timeit act[:]
1 loops, best of 3: 303 ms per loop

Also, we see here that the sweet point is using 3 threads, not more 
(don't ask why).  However, that does not mean that Blosc is not able to 
work faster on this machine, and in fact it does:

In [59]: import blosc

In [60]: sa = a.tostring()

In [61]: ac2 = blosc.compress(sa, 2, clevel=5)

In [62]: blosc.set_nthreads(6)
Out[62]: 6

In [64]: timeit a2 = blosc.decompress(ac2)
10 loops, best of 3: 80.7 ms per loop

In [65]: blosc.set_nthreads(1)
Out[65]: 6

In [66]: timeit a2 = blosc.decompress(ac2)
1 loops, best of 3: 249 ms per loop

So that means that a pure Blosc compression in-memory can only go 4x 
faster than PyTables + Blosc, and in this is case the latter is reaching 
an excellent mark of 2 GB/s, which is really good for a read from disk 
operation.  Note how a memcpy() operation in this machine is just about 
as good as this:

In [36]: timeit a.copy()
1 loops, best of 3: 294 ms per loop

Now that I'm on this, I'm curious on how other compressors would perform 
for this scenario:

In [6]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, 
filters=tb.Filters(5, complib="lzo"), chunkshape=(2**20,))

In [7]: act[:] = a

In [8]: f.flush()

In [9]: ll test.h5  # compression ratio very close to Blosc
-rw-rw-r-- 1 faltet 302769510 Dec  6 05:23 test.h5

In [10]: timeit act[:]
1 loops, best of 3: 1.13 s per loop

so, the time for LZO is more than 3x slower than Blosc.  And a similar 
thing with zlib:

In [12]: f.close()

In [13]: f = tb.openFile("test.h5", "w")

In [14]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, 
filters=tb.Filters(1, complib="zlib"), chunkshape=(2**20,))

In [15]: act[:] = a

In [16]: f.flush()

In [17]: ll test.h5     # the compression rate is somewhat better
-rw-rw-r-- 1 faltet 254821296 Dec  6 05:26 test.h5

In [18]: timeit act[:]
1 loops, best of 3: 2.24 s per loop

which is 6x slower than Blosc (although the compression ratio is a bit 
better).

And just for matter of completeness, let's see how fast can perform 
carray (the package, not the CArray object in PyTables) for a chunked 
array in-memory:

In [19]: import carray as ca

In [20]: ac3 = ca.carray(a, chunklen=2**20, cparams=ca.cparams(5))

In [21]: ac3
Out[21]:
carray((300000000,), int16)
   nbytes: 572.20 MB; cbytes: 289.56 MB; ratio: 1.98
   cparams := cparams(clevel=5, shuffle=True)
[59 34 36 ..., 21 58 50]

In [22]: timeit ac3[:]
1 loops, best of 3: 254 ms per loop

In [23]: ca.set_nthreads(1)
Out[23]: 6

In [24]: timeit ac3[:]
1 loops, best of 3: 282 ms per loop

So, with 254 ms, it is only marginally faster than PyTables (~298 ms).  
Now with a carray object on-disk:

In [27]: acd = ca.carray(a, chunklen=2**20, cparams=ca.cparams(5), 
rootdir="test")

In [28]: acd
Out[28]:
carray((300000000,), int16)
   nbytes: 572.20 MB; cbytes: 289.56 MB; ratio: 1.98
   cparams := cparams(clevel=5, shuffle=True)
   rootdir := 'test'
[59 34 36 ..., 21 58 50]

In [30]: ca.set_nthreads(6)
Out[30]: 1

In [31]: timeit acd[:]
1 loops, best of 3: 317 ms per loop

In [32]: ca.set_nthreads(1)
Out[32]: 6

In [33]: timeit acd[:]
1 loops, best of 3: 361 ms per loop

The times in this case are a bit larger than with PyTables (317ms vs 
298ms), which speaks a lot how efficiently is implemented I/O in 
HDF5/PyTables stack.

-- 
Francesc Alted


------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Multithreaded decompress unexpectedly does not help

Reply via email to