Hi Elias,

A Tuesday 04 September 2007, [EMAIL PROTECTED] escrigué:
> Hello again,
>
> This topic is of great interest to me as I have been attempting to
> tune the chunkshape parameter manually.
>
> After our last exchange, I took your suggestions and made all my
> index searches in-memory to get max speed. What I found was initially
> very surprising, but on reflection started to make sense: I actually
> had a greater bottleneck due to how I organized my data vs. how it
> was being used. To whit, I had a multidimensional array with a shape
> like this:
>
> {1020, 4, 15678, 3}
>
> but I was reading it -- with PyTables -- like so:
> >>> data = earrayObject[:,:,offset,:]
>
> With small arrays like {20, 4, 15678, 3} it is not so noticeable, but
> with the combination of large arrays and the default chunkshape, a
> lot of time was being spent slicing the array.

Mmmm, what do you mean by your 'default' chunkshape?  Your application 
chunkshape or a PyTables automatic chunkshape?.  You are not saying 
which 'default' chunkshape are you using, but, in your example above, 
and for your kind of access pattern, a pretty optimal chunkshape would 
be {20, 4, 1, 3}, because you only need to read one element of the 
third dimension on each access, avoiding further unnecessary 
reads/decompressions.  However, having a chunksize in the third 
dimension moderately larger than 1 could represent a good I/O balance.  
See below.

> The switch to PyTables (from h5import) I was able to easily
> reorganize the data to be more efficient for how I was reading it,
> ie,
>
> >>> earrayObject.shape
>
> (15678L, 4L, 1020L, 3L)
>
> >>> data = earrayObject[offset,:,:,:]

In PyTables 2.0 you could also set the third dimension as the main one, 
and the chunkshapes will be computed optimally (I mean, for sparse 
access along the main dim and reasonably fast appends).

> It seems to me then, that chunkshape could be selected to also give
> optimal, or near-optimal performance. My problem now is that as I
> make the chunks smaller, I get better read performance (which is the
> goal), but write performance (not done very often) has slowed way
> down. I suppose this makes sense, as smaller chunks implies more
> trips to the disk for I/O writing the entire array.

That's correct.

> So are there any guidelines to balance reading vs writing performance
> with chunkshape? Right now I'm just trying 'sensible' chunkshapes and
> seeing what the result is. Currently, I'm leaning toward something
> like (32, 4, 256, 3). The truth is, only one row is ever read at a
> time, but the write time for (1, 4, 512, 3) is just too long. Is
> there an obvious flaw in my approach that I cannot see?

Not so obvious, because an optimal chunkshape depends largely on your 
access pattern and whether you want to optimize reads, writes or get a 
fair balance between them.  So, your mileage may vary.

As a tip, it is always good to write a small benchmark and see the best 
parameters for your case (I know that this takes time, and if you were 
to write this in plain C, perhaps you would think twice about doing 
this, but hey, you are using Python! ;).  As an example, I've made such 
a benchmark that times read/write operations on a scenario similar to 
yours (see attached script).

This benchmark selects a chunksize of 1 (labeled as 'e1'), 5 ('e5') and 
10 ('e10') for the main dimension and measure the times for doing a 
sequential write and a random sparse reads (along the main dimension).  
Here are the results when using zlib (and shuffle) compressor:

************** Writes ************
e1. Time took for writing: 7.567
e5. Time took for writing: 2.361
e10. Time took for writing: 1.873
************** Reads *************
e1. Time took for 1000 reads: 0.588
e5. Time took for 1000 reads: 0.669
e10. Time took for 1000 reads: 0.755

So, using a chunksize of 1 in the maindim is optimal for random reads 
(as expected), but it takes a lot of time for writes.  A size of 10 
offers best writing times and poor read times.  In this case, 5 seems 
to represent a reasonable good balance for write/read.

If you want better speed but still keep using compression, the LZO 
compressor does perform very well in this scenario.  Here are the times 
for LZO (and shuffle):

************** Writes ************
e1. Time took for writing: 4.847
e5. Time took for writing: 1.602
e10. Time took for writing: 1.281
************** Reads *************
e1. Time took for 1000 reads: 0.532
e5. Time took for 1000 reads: 0.568
e10. Time took for 1000 reads: 0.611

which represents up to a 50% of speed-up for writes and up to 18% faster 
on sparse reads.

Finally, removing compression completely might seem the best bet for 
optimize reads, but this can get tricky (and it gets tricky actually).  
The times when disabling compression are:

************** Writes ************
e1. Time took for writing: 4.337
e5. Time took for writing: 1.428
e10. Time took for writing: 1.076
************** Reads *************
e1. Time took for 1000 reads: 0.751
e5. Time took for 1000 reads: 2.979
e10. Time took for 1000 reads: 0.605

i.e. for writes there is a neat win, but reads perform generally slower 
(specially for the chunksize 5 which is extremely slow, but I don't 
know exactly why).

> Also, should I avoid ptrepack, or is there a switch that will
> preserve my carefully chosen chunkshapes? I have the same situation
> as Gabriel in that I don't know what the final number of rows my
> EArray will have (it's the now the third dimension that is the
> extensible axis) and I just take the default, expectedrows=1000.

Well, if you want to preserve your carefully tuned chunkshape, then you 
shouldn't use ptrepack, as it is meant to re-calculate chunkshape in 
order to adapt to general uses, that could not coincide with your 
specific needs (as it is generally the case when you want to find 
extremely fine-tuned chunkshape parameters).

Mmm, I'm thinking that perhaps adding a 'chunkshape' argument to 
Leaf.copy() would be a good thing for those users who want to 
explicitely set their own chunkshape on the destination leaf. I'll add 
a ticket so that we don't forget about this.

Hope that helps,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

Attachment: prova.py
Description: application/python

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to