Re: [Hdf-forum] H5 Map - a HDF5 chunk caching API

Pedro Silva Vicente Mon, 25 Jul 2011 17:51:58 -0700

Hi Mark,

>> Sorry if my response is off target.

Not at all off target, it's right spot on in target [ Like most of these email 
list subscribers located outside of Champaign IL, we never met personally 
but I have been reading your comments for like, 10 years?, and they are always 
on target]

>> If I read your document correctly, it proposes a couple of solutions.

Correct: 2 solutions for a problem that was basically a "bad" use of HDF5.

The Histotool software saves data in real time from neuton experiments. The 
NeXus format (based in HDF5) is used

http://www.nexusformat.org/Main_Page

One of the things HistoTool does is to append data as described in the PDF. 
However, due to the way the NeXus API was used, performance was very slow. It 
was several orders of magnitude slower 
than using a plain binary file to save the experiment results, so one question 
that come up was

"Why should I use HDF5 instead of a binary file, if it's several orders of 
magnitude slower?"

So, I implemented the 2 solutions explained.

This API can be used as a "use case" of how to circumvent the "problems"  of 
the chunk design (by "problems" I mean the fact that using HDF5 as in the 
original software implementation,  performance was several orders of magnitude 
slower than a plain binary file; this has to do with the way chunks are 
designed)

>>One is to add your own temporary storage layer to act as a sort of
>>impedance matcher thereby enforcing that any actual I/O requests to HDF5
>>match on cache block boundaries.

Right. In this case the user had to modify the software, just to avoid the 
"problem" mentioned. That should not happen, the fact that a user has to modify 
his code to avoid that problem.

>>The other solution involves a 'new API' I think. Is this a proposed new
>>API for HDF5 Lib or HDF5 HL or HDF5 lite interfaces?

It is a new API. At this moment it is not previewed that will be incorporated 
in HDF5. 
But it would be certainly possible with some changes to make it more of general 
use maybe. 
The datasets are all 1D and I use extensively STL (vectors) but this could be 
changed to be of general use (like input/output format data)

>Is the main issue that you need to be able to adjust chunk cache size based on 
>your
>application's needs at the time the data is being read or written?

Yes. It allows to add the chunk size as a parameter (currently 8GB);  on top of 
this, it keeps track of a multitude of *open* datasets in a STL map (pair path, 
HDF5 dataset ID); 
the purpose of this is to avoid closing  and opening datasets frequently (that 
is as least as possible).

> I think the HDF5 library has all the functions you'd need to do that already.

Correct. This API is a use case of it.
I posted it as a general information to the community. Hopefully it will help 
someone that at some point is faced with a similar problem.

Like I said "comments/questions/ suggestions" are welcome (Like "can I have it 
too?" )

Thanks, Mark, for the comments

Feel free to follow up with more questions or suggestions of how to improve it 
to be of more general use.

Pedro 

  ----- Original Message ----- 
  From: Mark C. Miller 
  To: HDF Users Discussion List 
  Sent: Monday, July 25, 2011 5:35 PM
  Subject: Re: [Hdf-forum] H5 Map - a HDF5 chunk caching API

  Hi Pedro,

  Ok, I am not sure I fully understand what you are proposing or
  requesting but I certainly won't let that stop me from sharing my
  opinions ;)

  It sounds like you are dealing with the fundamental 'read-modify-write'
  performance problems that often occurs in caching algorithms where
  operations 'above' the cache span multiple cache blocks.

  If I read your document correctly, it proposes a couple of solutions.
  One is to add your own temporary storage layer to act as a sort of
  impedance matcher thereby enforcing that any actual I/O requests to HDF5
  match on cache block boundaries.

  The other solution involves a 'new API' I think. Is this a proposed new
  API for HDF5 Lib or HDF5 HL or HDF5 lite interfaces? Is the main issue
  that you need to be able to adjust chunk cache size based on your
  application's needs at the time the data is being read or written? If
  so, I think the HDF5 library has all the functions you'd need to do that
  already. So, its really not clear to me what value added you are
  proposing. If your datasets are 1 dimensional and you are processing
  more or less sequentially through it, as your pictures suggest, then I'd
  think a cache size equal to a few blocks should be sufficient to avoid
  the read-modify-write behavior. If its 2D, and your access is more or
  less a sliding 2D window, then I'd think a cache size of about 9 blocks
  would be sufficient to avoid read-modify-write behavior. If its 3D, then
  27 blocks. So, I figure I must be missing something that motivates this
  because beyond manipulating the cache size, I am not seeing what else
  your 'new API' solution provides.

  Sorry if my response is off target.

  Mark

  On Mon, 2011-07-25 at 14:36 -0700, Pedro Silva Vicente wrote:
  >  
  > Dear  HDF community
  >  
  > Please find a document detailing an HDF5 API develop at ORNL regarding
  > chunk caching.
  >  
  > At this moment I would be very happy in receiving comments/questions/
  > suggestions.
  > 
  >  
  > 
  > ----------------------
  > Pedro Vicente
  > [email protected]
  > http://www.space-research.org/
  >  
  >  
  >  
  > 
  >  

  _______________________________________________
  Hdf-forum is for HDF software users discussion.
  [email protected]
  http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] H5 Map - a HDF5 chunk caching API

Reply via email to