If you want to have multiple ranks write to the same file, you’ll need to open 
the file in read-write and use parallel HDF5 with the associated overhead and 
complexity of the collective calls. I think the only way to avoid the overhead 
of the collective calls is to open separate files for each rank.

If you are going to have a multi-file approach, and read from files which are 
open in write mode by another process, you’ll need to have some way to get the 
metadata updated in the reading processes. It sounds like you might try another 
1.10.x addition, the single-writer multiple-reader. If each rank can open its 
own output file in read-write, and all the other ranks’ files in read-only, you 
can avoid the parallel overhead. I haven’t tried this approach, and you’ll have 
to be careful of race conditions and keep the file metadata correct in all the 
ranks, but it sounds like it might fit your parallel I/O model. 
https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

Jarom

From: Hdf-forum [mailto:[email protected]] On Behalf Of 
Chris Green
Sent: Monday, July 25, 2016 3:41 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Parallel dataset resizing strategies


Hi,

Thanks for this. Comments inline.
On 7/22/16 12:13 PM, Nelson, Jarom wrote:

If you can move to HDF5 1.10, I would recommend independent files for each MPI 
rank, and then create a master file (created independently perhaps by rank 0) 
with Virtual Datasets linking in the data from each rank in the format you 
need. Virtual Datasets can be created with file matching patterns for 
dynamically increasing datasets, so you might look into using that feature.

We don't have existing tools relying on a particular version, so we are 
nominally free to move to HDF5 1.10.x. However, it won't be completely 
straightforward because I have been relying for now on using the homebrew 
version, which is currently 1.18.16. I'd have to dink the recipe to use 1.10.x, 
which is not a showstopper.
I found this approach much faster than creating a collective file (~5-10x 
speedup on a Lustre filesystem). You don’t need to do any collective reads or 
writes, and I think we could even bypass using parallel HDF5 altogether. Note, 
this will only work if you only ever need to open the Virtual Dataset in 
parallel (i.e. by more than one process) as non-collective read-only. If you 
need to have read-write access to the master file, you can’t access a Virtual 
Dataset using collective operations. You can, however, have as many processes 
as you like read from a virtual dataset from a file opened as read-only.

If you have other tools that use your data but can’t move to HDF5 1.10, you can 
h5repack a file with Virtual Datasets to remove the Virtual Datasets, and it 
should be compatible with HDF5 1.8 (use h5repack from HDF5 1.10 patch 1 or 
later). This also worked well for us and I was able to load a repacked file in 
IDL under a 1.8 HDF5 library. However h5repack is not a parallel application, 
so it can be slow to repack a very large file, on the order minutes per GB.

After having thought a little more about likely parallel models, I think now we 
can arrange that:

·         Only one rank will write to a particular dataset.

·         A dataset will not be read from in the same job in which it was 
written.

·         A dataset may be read by one or more ranks.

I *think* if that's the case, we could use a hierarchical multi-file format 
without resorting to virtual datasets, no? I still have some reading and 
experimenting to do, but if you have particular information that would speak to 
the likely success of this approach, I'd be happy to hear it.

Thanks,

Chris.

Jarom

From: Hdf-forum [mailto:[email protected]] On Behalf Of 
Chris Green
Sent: Friday, July 22, 2016 9:32 AM
To: [email protected]<mailto:[email protected]>
Subject: [Hdf-forum] Parallel dataset resizing strategies


Hi,

I am relatively new to HDF5 and HDF5/parallel, and although I have experience 
with MPI it is not extensive. We are exploring ways of saving data in parallel 
using HDF5 in a field in which it is practically unknown up to now.

Our paradigm is "parallel modular event processing:"

  *   A typical job processes many "events."
  *   An event contains all of the interesting data (raw and processed) 
associated with some time interval.
  *   Each event can be processed independently of all other events.
  *   Each event's data can be subdivided into internal components, "data 
products."
  *   "Modules" are processing subunits which read or generate one or more data 
products for each event.
  *   One can calculate a data dependency graph specifying the allowed ordering 
and/or parallelism of modules processing one or more events simultaneously for 
a given job configuration and event structure.
We have been using h5py with HDF5 and OpenMPI to explore different strategies 
for parallel I/O in a future parallel event-processing framework. One of the 
approaches we have come up with so far is to have one HDF5 dataset per unique 
data product / writer module combination, keeping track of the different 
relevant sections of each dataset via (for now) an external database. This 
works well in serial tests, but in parallel tests we are running up against the 
constraint that dataset resizing is a collective operation, meaning that all 
ranks including non-writers will have to become aware of and duplicate dataset 
resizing operations required by other writers. The problem seems to get even 
worse if there's a possibility that two or more instances of a module would 
need to extend and write to the same dataset at the same time (while processing 
different events, say), since they will have to coordinate and agree on the new 
size of the dataset and their respective sections thereof.

Are we misunderstanding the problem, or is it really this hard? Has anyone else 
hit upon a reasonable strategy for handling this or something like it?

Any pointers appreciated.

Thanks,

Chris Green.

--

Chris Green <[email protected]><mailto:[email protected]>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: [email protected]<mailto:[email protected]>, chissgreen (AIM),

chris.h.green (Google Talk).




_______________________________________________

Hdf-forum is for HDF software users discussion.

[email protected]<mailto:[email protected]>

http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5



--

Chris Green <[email protected]><mailto:[email protected]>, FNAL CS/SCD/ADSS/SSI/TAC;

'phone (630) 840-2167; Skype: chris.h.green;

IM: [email protected]<mailto:[email protected]>, chissgreen (AIM),

chris.h.green (Google Talk).
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to