On Feb 10, 2012, at 11:40 AM, Francesc Alted wrote:

> Apparently sent from an unsubscribed address.
> 
> Begin forwarded message:
> 
>> From: pytables-users-boun...@lists.sourceforge.net
>> Subject: Auto-discard notification
>> Date: February 10, 2012 12:14:45 AM GMT+01:00
>> To: pytables-users-ow...@lists.sourceforge.net
>> 
>> The attached message has been automatically discarded.
>> From: Jimmy Paillet <jimmy.pail...@gmail.com>
>> Subject: Performance advice/Blosc
>> Date: February 10, 2012 12:14:19 AM GMT+01:00
>> To: pytables-users@lists.sourceforge.net
>> 
>> 
>> Hey,
>> 
>> I'd like to ask some advice about pytables data organization and compression 
>> performance....
>> 
>> My data set is just a big table (500Mrows, 45 columns), the file size is 
>> 70GB, compressed with blosc-4... the compression ratio is around 2-3.
>> Several ultralight indexes.
>> Python 2.5, pytables 2.3.1 ubuntu 8.04 64 bits, 4core Intel Xeon 12GB RAM.
>> 
>> The file is on a NAS, which I am linked to with a GbE link.
>> Performance was not that bad for a max IO bandwidth of 90 MB/s
>> 
>> To see how it would scale with I/O speed, I set up a 3-SSD RAID 0 
>> (sequential read speeds up to 660 MB/s)
>> I got a bit disappointed. Yes, very selective queries that can use indexes 
>> are very much faster on the RAID (up to 6 times).
>> However, broader queries are almost on par with the speed I got from the NAS 
>> system, which seemed weird as it's getting close 
>> to sequential reads. This is the queries I wanted to speed up!
>> 
>> It seems I can't get past 80-90 MB/s when reading a compressed h5. 
>> It's roughly the same with lzo or blosc (except lzo compressed 2 times 
>> more)...
>> Does that number seems reasonable? Reading from lzo and especially blosc on 
>> the web, it looks a bit underwhelming in comparison....
>> Am I missing something?

The problem here is probably related with the length size of your records which 
is probably larger than 256 bytes, and for such sizes the shuffle filter 
becomes inactive (this is because I have found that it tends to introduce too 
much performance penalization).

In case you want to experiment by yourself, you can raise the value of 
BLOSC_MAX_TYPESIZE in:

https://github.com/PyTables/PyTables/blob/master/blosc/blosc.h#L42

and recompile.  I'm guessing here, but I don't think this could bring other 
problems.  At any rate, if you do so, I'm curious about the results, so please 
post them here.

The best solution is to implement column-wise tables (the current ones in 
PyTables are row-wise).

>> One of my issues I believe is that I can't get more than one decompressing 
>> blosc thread, even though I set tables.setBloscMaxThreads(6).
>> Any ideas of what is happening here?

Hmm, that should work.  Why are you saying that this does not work?

>> 
>> On uncompressed files, I can reach the 600MB/s limit when doing reads.  But 
>> since I get files that are 2 to 6 times bigger, 
>> I often end up with similar performances. So I wonder how to scale my system.
>> 
>> Thanks for any input.
>> J.

-- Francesc Alted







------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to