Re: [Pytables-users] Simple fast array I/O with Pytables

Francesc Alted Fri, 30 Dec 2011 15:33:11 -0800

2011/12/30 Dav Clark <d...@alum.mit.edu>:
> On Dec 30, 2011, at 8:40 AM, Francesc Alted wrote:
>
>> 2011/12/30 Gael Varoquaux <gael.varoqu...@normalesup.org>:
>>> Hi list,
>>>
>>> I am trying to do a simple comparison of various I/O libraries to save a
>>> bunch of numpy arrays. I don't have time to actually invest in PyTables
>>> now, but it has always been on my radar. I wanted to get a ball-park
>>> estimate of what was achievable with PyTables in terms of read/write
>>> performance. I wrote a quick pair of read and write functions, and I am
>>> getting really bad performance.
>>>
>>> Obviously, I should invest in learning PyTables, but right now I am just
>>> trying to get figures to justify such an investement. Can somebody have a
>>> look at the following code to see if I haven't forgotten something
>>> obvious that would make I/O faster. Sorry, I feel like I am asking you to
>>> do my work, but I hate it that Pytabls is coming out so bad on the
>>> benchs:
>> [clip]
>>
>> This depends a lot on the sort of arrays you are trying to save. Have
>> they the same shape and type?  Then it is best to save them in a
>> monolithic Array (or an EArray, if you want to use compression).
>>
>> If they have the same type but different shapes, then using a separate
>> entry in the same VLArray would be more effective.  In case the arrays
>> are large, it may be useful to use a high performance compressor (e.g.
>> Blosc) so as to reduce its size.
>>
>> If your arrays do not share dtypes or shapes at all, then I'm afraid
>> this the best performance you can expect from PyTables.  Is this that
>> bad compared with other options?
>
> What about compression? I'm guessing you're comparing to .npz files, which 
> would be compressed but likely without the efficiency of blosc5. You'll 
> probably get a modest net savings on write + read time. See:
>
> http://pytables.github.com/usersguide/optimization.html
>
> for trade-offs in read and write speeds.


Yeah, it is true that compression can make a good difference in case
your datasets are compressible.  I've made some simple speed tests
(see attached script) for getting an idea:

Time to write (A): 9.194
Time to read (A): 1.14
Time to write (CA, zlib): 21.939
Time to read (CA, zlib): 7.311
Time to write (EA, zlib): 31.954
Time to read (EA, zlib): 7.063
Time to write (CA, blosc): 4.016
Time to read (CA, blosc): 1.807
Time to write (EA, blosc): 13.930
Time to read (EA, blosc): 1.788
Time to write (VL): 9.399
Time to read (VL): 1.213
Time to write (VL, zlib): 60.983
Time to read (VL, zlib): 7.743
Time to write (VL, blosc): 3.658
Time to read (VL, blosc): 1.252
Time to write (VL, blosc2): 3.132
Time to read (VL, blosc2): 1.234

These are the timings for saving and retrieving 1000 double precision
arrays with 100,000 elements each using different methods. 'A' is
gael's original method of using plain 'Array' objects. 'CA' and 'EA'
refers to using 'CArray' and 'EArray' objects, while 'VL' means using
a 'VLArray'.

Looking above it is clear that the best timings are obtained by using
a VLArray with blosc (blosc2 is a slight variation of the same
technique, but using the special `blosc.pack_array()` function, which
is not found in the zlib package).

Here are the final sizes for the different methods:

-rw-rw-r-- 1 francesc francesc 764M Dec 30 23:48 test_a.h5
-rw-rw-r-- 1 francesc francesc 132M Dec 30 23:49 test_ca_blosc.h5
-rw-rw-r-- 1 francesc francesc  71M Dec 30 23:48 test_ca_zlib.h5
-rw-rw-r-- 1 francesc francesc 132M Dec 30 23:49 test_ea_blosc.h5
-rw-rw-r-- 1 francesc francesc  71M Dec 30 23:49 test_ea_zlib.h5
-rw-rw-r-- 1 francesc francesc 132M Dec 30 23:51 test_vl_blosc2.h5
-rw-rw-r-- 1 francesc francesc 132M Dec 30 23:51 test_vl_blosc.h5
-rw-rw-r-- 1 francesc francesc 764M Dec 30 23:49 test_vl.h5
-rw-rw-r-- 1 francesc francesc 547M Dec 30 23:50 test_vl_zlib.h5

Here, it is clear that the best compression ratio is obtained by using
zlib in combination with CArray/EArray, but at the expense of using
too much CPU.  Using blosc allows for a reasonable compression ratio,
while achieving a pretty good speed (around 250 MB/s for writing and
600 MB/s for reading, which is quite impressive).

In addition, the VLArray/blosc method is the most general, as it
allows for saving arrays of different shapes  and types.  So I'd say
that Gael would obtain his best results by using the
`[write,read]_vl_blosc[2]_hdf()` methods.

Furthermore, fine-tuning the compression level for VLArray/blosc
method, I've been able to get these figures for a value of 6:

Time to write (VL, blosc): 1.89
Time to read (VL, blosc): 1.061

which stands for 400 MB/s for writing and 720 MB/s for reading which
is pretty impressive.  However, the compression ratio has been
degraded somewhat:

-rw-rw-r-- 1 francesc francesc 172M Dec 30 23:45 test_vl_blosc.h5

i.e. 172 MB instead of 132 MB which was achieved with blosc level 9
(maximum).  But hey, this is still a long departure from the original
764 MB.

> I'll add that it would be cool to see what numbers you come up with (maybe 
> with some loose specs on the machine CPU and disk you used).

Yeah, that would be useful indeed.  Please be aware that most of the
results (specially reading times) above are (much) faster than disk
speed because the datasets fit in-memory (24 GB for the testing
machine).  If Gael's datasets do not fit in-memory, he will see much
less speed.  At any rate, I don't think PyTables would be the
bottleneck in this case.

-- 
Francesc Alted

gael.py
Description: Binary data

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Simple fast array I/O with Pytables

Reply via email to