Hi again, Thanks to Ger and George for their replies. It turns out that my memory problems were caused by me not closing a dataspace, so I was creating a new dataspace for each hyperslab but not closing the old one...hence massive memory use! I fixed that, but still had a slow program - so I took the advice and switched to reading [1,2200] hyperslabs, and that helped significantly. Playing about with the chunk sizes also helped, so now I have a nice, fast, program for loading all the data.
Thanks again, you've been a great help - and saved me from a lot of problems in getting this working nicely. Simon. >>> "Ger van Diepen" <[email protected]> 1/28/2011 2:55 pm >>> We used mmap-ed IO in that application, but that didn't help. The physical disk seeks still have to be done. I must say that we did this on a single RAID array. On a large disk subsystem data might be spread over many more disks and is leapfrogging through the data less painful. Cheers, Ger >>> "George N. White III" <[email protected]> 1/28/2011 2:47 PM >>> On Thu, Jan 27, 2011 at 4:43 PM, Simon R. Proud <[email protected]> wrote: > Thanks for the reply! > >>Are you seeing a lot of disk activity after the data have been loaded >>into memory? That would indicate >>excessive swapping. Low CPU usage (CPU is waiting on I/O) is another >>indicator. There are usually some OS-specific tools to gather >>statistics on vm usage and swapping. Are the data on a local disk or >>a network server? > > The entire thing is being run on a cluster, so I can't check disk activity - > but the data is local to the program. > However, I can see that the program is fast at loading the first 60ish > files, and then slows down. As soon as that slowdown occurs I also see > virtual memory useage increase, so I assume it's loading data into VM rather > than physical RAM. > >>You need to tell us more about how the data are used. One common >>example is where the calculation is repeated for each (i,j) coord. all >>100+ files, so there is no need to store complete arrays, but you want >>parts of all arrays to be stored at the same time. Another is a >>calculation that uses data from one array at a time, so there is no >>need to store more than one array at a time. > > Yes, I'm performing the former - processing each i,j element individually. > It is remote sensing data, with each file being a separate observation, so > what I'm doing is processing a timeseries on a per-pixel basis. > As you say, there's no need to store the complete arrays, but my attempts at > loading only a small hyperslab (corresponding to one row of the input > images) have not been successful. > > Hope that makes sense, and thanks again. > Simon. Ger van Diepen's suggestions make sense to me. I know that some other sites that offer time-series views of RS data create a separate copy of the data organized as he suggests. What I don't know is whether it is still possible on a modern cluster and using hdf5 to take advantage of memory-mapped I/O for this use-case. Real life is more complicated as we want to do this with "binned" (integerized sinusoidal grid) data so don't have arrays. -- George N. White III <[email protected]> Head of St. Margarets Bay, Nova Scotia _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
