Re: [Hdf-forum] Efficient reading of HDF5 files

Simon R. Proud Fri, 04 Feb 2011 02:36:42 -0800

Hi again,
Thanks to Ger and George for their replies. It turns out that my memory 
problems were caused by me not closing a dataspace, so I was creating a new 
dataspace for each hyperslab but not closing the old one...hence massive memory 
use! 
I fixed that, but still had a slow program - so I took the advice and switched 
to reading [1,2200] hyperslabs, and that helped significantly. Playing about 
with the chunk sizes also helped, so now I have a nice, fast, program for 
loading all the data.

Thanks again, you've been a great help - and saved me from a lot of problems in 
getting this working nicely.
Simon.

>>> "Ger van Diepen" <[email protected]> 1/28/2011 2:55 pm >>>

We used mmap-ed IO in that application, but that didn't help. The
physical disk seeks still have to be done. 
I must say that we did this on a single RAID array. On a large disk
subsystem data might be spread over many more disks and is leapfrogging
through the data less painful. 

Cheers, 
Ger

>>> "George N. White III" <[email protected]> 1/28/2011 2:47 PM >>>
On Thu, Jan 27, 2011 at 4:43 PM, Simon R. Proud <[email protected]> wrote:

> Thanks for the reply!
>
>>Are you seeing a lot of disk activity after the data have been
loaded
>>into memory? That would indicate
>>excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
>>indicator. There are usually some OS-specific tools to gather
>>statistics on vm usage and swapping. Are the data on a local disk or
>>a network server?
>
> The entire thing is being run on a cluster, so I can't check disk
activity -
> but the data is local to the program.
> However, I can see that the program is fast at loading the first
60ish
> files, and then slows down. As soon as that slowdown occurs I also
see
> virtual memory useage increase, so I assume it's loading data into VM
rather
> than physical RAM.
>
>>You need to tell us more about how the data are used. One common
>>example is where the calculation is repeated for each (i,j) coord.
all
>>100+ files, so there is no need to store complete arrays, but you
want
>>parts of all arrays to be stored at the same time. Another is a
>>calculation that uses data from one array at a time, so there is no
>>need to store more than one array at a time.
>
> Yes, I'm performing the former - processing each i,j element
individually.
> It is remote sensing data, with each file being a separate
observation, so
> what I'm doing is processing a timeseries on a per-pixel basis.
> As you say, there's no need to store the complete arrays, but my
attempts at
> loading only a small hyperslab (corresponding to one row of the
input
> images) have not been successful.
>
> Hope that makes sense, and thanks again.
> Simon.

Ger van Diepen's suggestions make sense to me.  I know that some other
sites that offer time-series views of RS data create a separate copy
of the data organized as he suggests.   What I don't know is whether
it is still possible on a modern cluster and using hdf5 to take
advantage of memory-mapped I/O for this use-case.  Real life is more
complicated as we want to do this with "binned" (integerized
sinusoidal grid) data so don't have arrays.

--
George N. White III <[email protected]>
Head of St. Margarets Bay, Nova Scotia

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected] 
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Efficient reading of HDF5 files

Reply via email to