Hi Elias, A Tuesday 04 September 2007, [EMAIL PROTECTED] escrigué: > Hello again, > > This topic is of great interest to me as I have been attempting to > tune the chunkshape parameter manually. > > After our last exchange, I took your suggestions and made all my > index searches in-memory to get max speed. What I found was initially > very surprising, but on reflection started to make sense: I actually > had a greater bottleneck due to how I organized my data vs. how it > was being used. To whit, I had a multidimensional array with a shape > like this: > > {1020, 4, 15678, 3} > > but I was reading it -- with PyTables -- like so: > >>> data = earrayObject[:,:,offset,:] > > With small arrays like {20, 4, 15678, 3} it is not so noticeable, but > with the combination of large arrays and the default chunkshape, a > lot of time was being spent slicing the array.
Mmmm, what do you mean by your 'default' chunkshape? Your application chunkshape or a PyTables automatic chunkshape?. You are not saying which 'default' chunkshape are you using, but, in your example above, and for your kind of access pattern, a pretty optimal chunkshape would be {20, 4, 1, 3}, because you only need to read one element of the third dimension on each access, avoiding further unnecessary reads/decompressions. However, having a chunksize in the third dimension moderately larger than 1 could represent a good I/O balance. See below. > The switch to PyTables (from h5import) I was able to easily > reorganize the data to be more efficient for how I was reading it, > ie, > > >>> earrayObject.shape > > (15678L, 4L, 1020L, 3L) > > >>> data = earrayObject[offset,:,:,:] In PyTables 2.0 you could also set the third dimension as the main one, and the chunkshapes will be computed optimally (I mean, for sparse access along the main dim and reasonably fast appends). > It seems to me then, that chunkshape could be selected to also give > optimal, or near-optimal performance. My problem now is that as I > make the chunks smaller, I get better read performance (which is the > goal), but write performance (not done very often) has slowed way > down. I suppose this makes sense, as smaller chunks implies more > trips to the disk for I/O writing the entire array. That's correct. > So are there any guidelines to balance reading vs writing performance > with chunkshape? Right now I'm just trying 'sensible' chunkshapes and > seeing what the result is. Currently, I'm leaning toward something > like (32, 4, 256, 3). The truth is, only one row is ever read at a > time, but the write time for (1, 4, 512, 3) is just too long. Is > there an obvious flaw in my approach that I cannot see? Not so obvious, because an optimal chunkshape depends largely on your access pattern and whether you want to optimize reads, writes or get a fair balance between them. So, your mileage may vary. As a tip, it is always good to write a small benchmark and see the best parameters for your case (I know that this takes time, and if you were to write this in plain C, perhaps you would think twice about doing this, but hey, you are using Python! ;). As an example, I've made such a benchmark that times read/write operations on a scenario similar to yours (see attached script). This benchmark selects a chunksize of 1 (labeled as 'e1'), 5 ('e5') and 10 ('e10') for the main dimension and measure the times for doing a sequential write and a random sparse reads (along the main dimension). Here are the results when using zlib (and shuffle) compressor: ************** Writes ************ e1. Time took for writing: 7.567 e5. Time took for writing: 2.361 e10. Time took for writing: 1.873 ************** Reads ************* e1. Time took for 1000 reads: 0.588 e5. Time took for 1000 reads: 0.669 e10. Time took for 1000 reads: 0.755 So, using a chunksize of 1 in the maindim is optimal for random reads (as expected), but it takes a lot of time for writes. A size of 10 offers best writing times and poor read times. In this case, 5 seems to represent a reasonable good balance for write/read. If you want better speed but still keep using compression, the LZO compressor does perform very well in this scenario. Here are the times for LZO (and shuffle): ************** Writes ************ e1. Time took for writing: 4.847 e5. Time took for writing: 1.602 e10. Time took for writing: 1.281 ************** Reads ************* e1. Time took for 1000 reads: 0.532 e5. Time took for 1000 reads: 0.568 e10. Time took for 1000 reads: 0.611 which represents up to a 50% of speed-up for writes and up to 18% faster on sparse reads. Finally, removing compression completely might seem the best bet for optimize reads, but this can get tricky (and it gets tricky actually). The times when disabling compression are: ************** Writes ************ e1. Time took for writing: 4.337 e5. Time took for writing: 1.428 e10. Time took for writing: 1.076 ************** Reads ************* e1. Time took for 1000 reads: 0.751 e5. Time took for 1000 reads: 2.979 e10. Time took for 1000 reads: 0.605 i.e. for writes there is a neat win, but reads perform generally slower (specially for the chunksize 5 which is extremely slow, but I don't know exactly why). > Also, should I avoid ptrepack, or is there a switch that will > preserve my carefully chosen chunkshapes? I have the same situation > as Gabriel in that I don't know what the final number of rows my > EArray will have (it's the now the third dimension that is the > extensible axis) and I just take the default, expectedrows=1000. Well, if you want to preserve your carefully tuned chunkshape, then you shouldn't use ptrepack, as it is meant to re-calculate chunkshape in order to adapt to general uses, that could not coincide with your specific needs (as it is generally the case when you want to find extremely fine-tuned chunkshape parameters). Mmm, I'm thinking that perhaps adding a 'chunkshape' argument to Leaf.copy() would be a good thing for those users who want to explicitely set their own chunkshape on the destination leaf. I'll add a ticket so that we don't forget about this. Hope that helps, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
prova.py
Description: application/python
------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users