Your data arrays have shape [2000,2200] and I understand you read [2000,1] hyperslabs. In HDF5 the indices are in C-order, this the first axis varies slowest. I think It would be much better to read [1,2200] hyperslabs.
Another approach might be that the data are stored in a single tiled cube where time is an extendible axis. As an aside. We have (radio-astronomical) datasets of tens of GBytes ordered in time,baseline,freq, while a particular application needs to access the data in baseline,time,freq order. Even though the amount of data to read per baseline,time is about 8 KBytes, the seek times on the disk were killing (even when reading multiple baselines at a time). It proved that first resorting the data was cheaper than leapfrogging through them. Eventually the algorithm was changed such that as many time slots as fitting in physical memory are read, so the data access could be made sequential again. Although in this case HDF5 was not used, it will be the same for HDF5 I think. Cheers, Ger >>> "Simon R. Proud" <[email protected]> 1/27/2011 9:43 PM >>> Thanks for the reply! >Are you seeing a lot of disk activity after the data have been loaded >into memory? That would indicate >excessive swapping. Low CPU usage (CPU is waiting on I/O) is another >indicator. There are usually some OS-specific tools to gather >statistics on vm usage and swapping. Are the data on a local disk or >a network server? The entire thing is being run on a cluster, so I can't check disk activity - but the data is local to the program. However, I can see that the program is fast at loading the first 60ish files, and then slows down. As soon as that slowdown occurs I also see virtual memory useage increase, so I assume it's loading data into VM rather than physical RAM. >You need to tell us more about how the data are used. One common >example is where the calculation is repeated for each (i,j) coord. all >100+ files, so there is no need to store complete arrays, but you want >parts of all arrays to be stored at the same time. Another is a >calculation that uses data from one array at a time, so there is no >need to store more than one array at a time. Yes, I'm performing the former - processing each i,j element individually. It is remote sensing data, with each file being a separate observation, so what I'm doing is processing a timeseries on a per-pixel basis. As you say, there's no need to store the complete arrays, but my attempts at loading only a small hyperslab (corresponding to one row of the input images) have not been successful. Hope that makes sense, and thanks again. Simon.
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
