Hello again,
This topic is of great interest to me as I have been attempting to tune
the chunkshape parameter manually.
After our last exchange, I took your suggestions and made all my index
searches in-memory to get max speed. What I found was initially very
surprising, but on reflection started to make sense: I actually had a
greater bottleneck due to how I organized my data vs. how it was being
used. To whit, I had a multidimensional array with a shape like this:
{1020, 4, 15678, 3}
but I was reading it -- with PyTables -- like so:
>>> data = earrayObject[:,:,offset,:]
With small arrays like {20, 4, 15678, 3} it is not so noticeable, but with
the combination of large arrays and the default chunkshape, a lot of time
was being spent slicing the array.
The switch to PyTables (from h5import) I was able to easily reorganize the
data to be more efficient for how I was reading it, ie,
>>> earrayObject.shape
(15678L, 4L, 1020L, 3L)
>>> data = earrayObject[offset,:,:,:]
It seems to me then, that chunkshape could be selected to also give
optimal, or near-optimal performance. My problem now is that as I make the
chunks smaller, I get better read performance (which is the goal), but
write performance (not done very often) has slowed way down. I suppose
this makes sense, as smaller chunks implies more trips to the disk for I/O
writing the entire array.
So are there any guidelines to balance reading vs writing performance with
chunkshape? Right now I'm just trying 'sensible' chunkshapes and seeing
what the result is. Currently, I'm leaning toward something like (32, 4,
256, 3). The truth is, only one row is ever read at a time, but the write
time for (1, 4, 512, 3) is just too long. Is there an obvious flaw in my
approach that I cannot see?
Also, should I avoid ptrepack, or is there a switch that will preserve my
carefully chosen chunkshapes? I have the same situation as Gabriel in that
I don't know what the final number of rows my EArray will have (it's the
now the third dimension that is the extensible axis) and I just take the
default, expectedrows=1000.
With gratitude,
Elias Collas
Stress Methods Group
Gulfstream Aerospace Corp.
This e-mail message, including all attachments, is for the sole use of the
intended recipient(s) and may contain legally privileged and confidential
information. If you are not an intended recipient, you are hereby
notified that you have either received this message in error or through
interception, and that any review, use, distribution, copying or
disclosure of this message or its attachments is strictly prohibited and
is subject to criminal and civil penalties. All personal messages express
solely the sender's views and not those of Gulfstream Aerospace
Corporation. If you received this message in error, please contact the
sender by reply e-mail and destroy all copies of the original message.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users