Re: [Hdf-forum] Efficient reading of HDF5 files

Ger van Diepen Fri, 28 Jan 2011 00:16:27 -0800

Your data arrays have shape [2000,2200] and I understand you read
[2000,1] hyperslabs. 
In HDF5 the indices are in C-order, this the first axis varies slowest.
I think It would be much better to read [1,2200] hyperslabs.


Another approach might be that the data are stored in a single tiled
cube where time is an extendible axis.
 
As an aside. 
We have (radio-astronomical) datasets of tens of GBytes ordered in
time,baseline,freq, while a particular application needs to access the
data in baseline,time,freq order. Even though the amount of data to read
per baseline,time is about 8 KBytes, the seek times on the disk were
killing (even when reading multiple baselines at a time). It proved that
first resorting the data was cheaper than leapfrogging through them. 
Eventually the algorithm was changed such that as many time slots as
fitting in physical memory are read, so the data access could be made
sequential again. 
Although in this case HDF5 was not used, it will be the same for HDF5 I
think. 

Cheers, 
Ger 

>>> "Simon R. Proud" <[email protected]> 1/27/2011 9:43 PM >>>
Thanks for the reply!

>Are you seeing a lot of disk activity after the data have been loaded
>into memory? That would indicate
>excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
>indicator. There are usually some OS-specific tools to gather
>statistics on vm usage and swapping. Are the data on a local disk or
>a network server?

The entire thing is being run on a cluster, so I can't check disk
activity - but the data is local to the program.
However, I can see that the program is fast at loading the first 60ish
files, and then slows down. As soon as that slowdown occurs I also see
virtual memory useage increase, so I assume it's loading data into VM
rather than physical RAM.

>You need to tell us more about how the data are used. One common
>example is where the calculation is repeated for each (i,j) coord.
all
>100+ files, so there is no need to store complete arrays, but you
want
>parts of all arrays to be stored at the same time. Another is a
>calculation that uses data from one array at a time, so there is no
>need to store more than one array at a time.

Yes, I'm performing the former - processing each i,j element
individually. It is remote sensing data, with each file being a separate
observation, so what I'm doing is processing a timeseries on a per-pixel
basis.
As you say, there's no need to store the complete arrays, but my
attempts at loading only a small hyperslab (corresponding to one row of
the input images) have not been successful.

Hope that makes sense, and thanks again.
Simon.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Efficient reading of HDF5 files

Reply via email to