On Feb 10, 2012, at 11:40 AM, Francesc Alted wrote: > Apparently sent from an unsubscribed address. > > Begin forwarded message: > >> From: pytables-users-boun...@lists.sourceforge.net >> Subject: Auto-discard notification >> Date: February 10, 2012 12:14:45 AM GMT+01:00 >> To: pytables-users-ow...@lists.sourceforge.net >> >> The attached message has been automatically discarded. >> From: Jimmy Paillet <jimmy.pail...@gmail.com> >> Subject: Performance advice/Blosc >> Date: February 10, 2012 12:14:19 AM GMT+01:00 >> To: pytables-users@lists.sourceforge.net >> >> >> Hey, >> >> I'd like to ask some advice about pytables data organization and compression >> performance.... >> >> My data set is just a big table (500Mrows, 45 columns), the file size is >> 70GB, compressed with blosc-4... the compression ratio is around 2-3. >> Several ultralight indexes. >> Python 2.5, pytables 2.3.1 ubuntu 8.04 64 bits, 4core Intel Xeon 12GB RAM. >> >> The file is on a NAS, which I am linked to with a GbE link. >> Performance was not that bad for a max IO bandwidth of 90 MB/s >> >> To see how it would scale with I/O speed, I set up a 3-SSD RAID 0 >> (sequential read speeds up to 660 MB/s) >> I got a bit disappointed. Yes, very selective queries that can use indexes >> are very much faster on the RAID (up to 6 times). >> However, broader queries are almost on par with the speed I got from the NAS >> system, which seemed weird as it's getting close >> to sequential reads. This is the queries I wanted to speed up! >> >> It seems I can't get past 80-90 MB/s when reading a compressed h5. >> It's roughly the same with lzo or blosc (except lzo compressed 2 times >> more)... >> Does that number seems reasonable? Reading from lzo and especially blosc on >> the web, it looks a bit underwhelming in comparison.... >> Am I missing something?
The problem here is probably related with the length size of your records which is probably larger than 256 bytes, and for such sizes the shuffle filter becomes inactive (this is because I have found that it tends to introduce too much performance penalization). In case you want to experiment by yourself, you can raise the value of BLOSC_MAX_TYPESIZE in: https://github.com/PyTables/PyTables/blob/master/blosc/blosc.h#L42 and recompile. I'm guessing here, but I don't think this could bring other problems. At any rate, if you do so, I'm curious about the results, so please post them here. The best solution is to implement column-wise tables (the current ones in PyTables are row-wise). >> One of my issues I believe is that I can't get more than one decompressing >> blosc thread, even though I set tables.setBloscMaxThreads(6). >> Any ideas of what is happening here? Hmm, that should work. Why are you saying that this does not work? >> >> On uncompressed files, I can reach the 600MB/s limit when doing reads. But >> since I get files that are 2 to 6 times bigger, >> I often end up with similar performances. So I wonder how to scale my system. >> >> Thanks for any input. >> J. -- Francesc Alted ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users