2011/12/30 Dav Clark <d...@alum.mit.edu>: > On Dec 30, 2011, at 8:40 AM, Francesc Alted wrote: > >> 2011/12/30 Gael Varoquaux <gael.varoqu...@normalesup.org>: >>> Hi list, >>> >>> I am trying to do a simple comparison of various I/O libraries to save a >>> bunch of numpy arrays. I don't have time to actually invest in PyTables >>> now, but it has always been on my radar. I wanted to get a ball-park >>> estimate of what was achievable with PyTables in terms of read/write >>> performance. I wrote a quick pair of read and write functions, and I am >>> getting really bad performance. >>> >>> Obviously, I should invest in learning PyTables, but right now I am just >>> trying to get figures to justify such an investement. Can somebody have a >>> look at the following code to see if I haven't forgotten something >>> obvious that would make I/O faster. Sorry, I feel like I am asking you to >>> do my work, but I hate it that Pytabls is coming out so bad on the >>> benchs: >> [clip] >> >> This depends a lot on the sort of arrays you are trying to save. Have >> they the same shape and type? Then it is best to save them in a >> monolithic Array (or an EArray, if you want to use compression). >> >> If they have the same type but different shapes, then using a separate >> entry in the same VLArray would be more effective. In case the arrays >> are large, it may be useful to use a high performance compressor (e.g. >> Blosc) so as to reduce its size. >> >> If your arrays do not share dtypes or shapes at all, then I'm afraid >> this the best performance you can expect from PyTables. Is this that >> bad compared with other options? > > What about compression? I'm guessing you're comparing to .npz files, which > would be compressed but likely without the efficiency of blosc5. You'll > probably get a modest net savings on write + read time. See: > > http://pytables.github.com/usersguide/optimization.html > > for trade-offs in read and write speeds.
Yeah, it is true that compression can make a good difference in case your datasets are compressible. I've made some simple speed tests (see attached script) for getting an idea: Time to write (A): 9.194 Time to read (A): 1.14 Time to write (CA, zlib): 21.939 Time to read (CA, zlib): 7.311 Time to write (EA, zlib): 31.954 Time to read (EA, zlib): 7.063 Time to write (CA, blosc): 4.016 Time to read (CA, blosc): 1.807 Time to write (EA, blosc): 13.930 Time to read (EA, blosc): 1.788 Time to write (VL): 9.399 Time to read (VL): 1.213 Time to write (VL, zlib): 60.983 Time to read (VL, zlib): 7.743 Time to write (VL, blosc): 3.658 Time to read (VL, blosc): 1.252 Time to write (VL, blosc2): 3.132 Time to read (VL, blosc2): 1.234 These are the timings for saving and retrieving 1000 double precision arrays with 100,000 elements each using different methods. 'A' is gael's original method of using plain 'Array' objects. 'CA' and 'EA' refers to using 'CArray' and 'EArray' objects, while 'VL' means using a 'VLArray'. Looking above it is clear that the best timings are obtained by using a VLArray with blosc (blosc2 is a slight variation of the same technique, but using the special `blosc.pack_array()` function, which is not found in the zlib package). Here are the final sizes for the different methods: -rw-rw-r-- 1 francesc francesc 764M Dec 30 23:48 test_a.h5 -rw-rw-r-- 1 francesc francesc 132M Dec 30 23:49 test_ca_blosc.h5 -rw-rw-r-- 1 francesc francesc 71M Dec 30 23:48 test_ca_zlib.h5 -rw-rw-r-- 1 francesc francesc 132M Dec 30 23:49 test_ea_blosc.h5 -rw-rw-r-- 1 francesc francesc 71M Dec 30 23:49 test_ea_zlib.h5 -rw-rw-r-- 1 francesc francesc 132M Dec 30 23:51 test_vl_blosc2.h5 -rw-rw-r-- 1 francesc francesc 132M Dec 30 23:51 test_vl_blosc.h5 -rw-rw-r-- 1 francesc francesc 764M Dec 30 23:49 test_vl.h5 -rw-rw-r-- 1 francesc francesc 547M Dec 30 23:50 test_vl_zlib.h5 Here, it is clear that the best compression ratio is obtained by using zlib in combination with CArray/EArray, but at the expense of using too much CPU. Using blosc allows for a reasonable compression ratio, while achieving a pretty good speed (around 250 MB/s for writing and 600 MB/s for reading, which is quite impressive). In addition, the VLArray/blosc method is the most general, as it allows for saving arrays of different shapes and types. So I'd say that Gael would obtain his best results by using the `[write,read]_vl_blosc[2]_hdf()` methods. Furthermore, fine-tuning the compression level for VLArray/blosc method, I've been able to get these figures for a value of 6: Time to write (VL, blosc): 1.89 Time to read (VL, blosc): 1.061 which stands for 400 MB/s for writing and 720 MB/s for reading which is pretty impressive. However, the compression ratio has been degraded somewhat: -rw-rw-r-- 1 francesc francesc 172M Dec 30 23:45 test_vl_blosc.h5 i.e. 172 MB instead of 132 MB which was achieved with blosc level 9 (maximum). But hey, this is still a long departure from the original 764 MB. > I'll add that it would be cool to see what numbers you come up with (maybe > with some loose specs on the machine CPU and disk you used). Yeah, that would be useful indeed. Please be aware that most of the results (specially reading times) above are (much) faster than disk speed because the datasets fit in-memory (24 GB for the testing machine). If Gael's datasets do not fit in-memory, he will see much less speed. At any rate, I don't think PyTables would be the bottleneck in this case. -- Francesc Alted
gael.py
Description: Binary data
------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users