I am not sure I am understanding your question or the problem(s) you are hoping
You mention the “N:N case” and so I am assuming you are talking N processor
writing to N files. You don’t need parallel HDF5 to do that. You can use serial
HDF5 because each stream is a wholly independent file.
The only situation in which you *need* parallel HDF5 is when you want multiple
MPI processes (parts of a distributed parallel executable) to write to the
*same* file concurrently. Then, their work on the file has to be coordinated
(e.g. creation of HDF5 objects) and their I/O to read/write data from/to
objects in the file can be done either collectively (coordinated) or
independently. But, it sounds like MPI parallelism is not really what you are
looking for and that is especially true if you only want N:N case.
Now, if what you *really* want is multiple different OS processes (or maybe
threads within a single process) to be able to write concurrently to a single
file, then there are not really too many options for you *without* taking some
degree of responsibility of coordinating those processes *yourself*. HDF5 will
not do much to help you here. The support in HDF5 for calling it from multiple
threads within the same executable is pretty limited. The locking is very
coarse grained and in all likelihood winds up serializing the threads. And,
there is nothing HDF5 itself (nor any other I/O library for that matter) can do
if you want multiple different OS processes to write to the same file without
doing a lot of the *work* yourself to coordinate them.
Finally, your note gives me the impression that maybe what you are looking for
is one (or more) processes whose number grows (and maybe shrinks) over the life
of the application and where each processor needs to write some data. If that
is your ultimate goal, I think there are various ways you could try to
implement that both with and without MPI and parallel HDF5. For example, if you
went with MPI and had a loose upper bound on the total number of processes you
needed, then you could mpiexec that number but then idle/sleep all those that
don’t need to be running at a particular time….you could dynamically create MPI
communicators that represent the current number of tasks you need and you could
open and use a *single* HDF5 file on that communicator shared among those
processors. Then, if you need to change the number of tasks, you would close
the HDF5 file, close the communicator and create a new communicator on a
different number of tasks and re-open the file with that new communicator.
There is a lot involved there but I think it could be made to work. But, that
is *only* if you want a single file that is routinely being written to by
differing number of tasks. If you really just want N files from N tasks and N
varies with time, then why not just use the OS to spawn I/O tasks and each task
opens a uniquely named HDF5 perhaps by internal task id or something?
Finally, it might be worth having a look at this HDF5 Blog post….
Not sure any of this is helpful but I thought I would mention some ideas.
"Hdf-forum on behalf of Stefano Salvadè" wrote:
Good morning everyone,
I’ve recently started using parallel HDF5 for my company, as we wish to save
analysed data on multiple files at a time. It would be an N:N case, with an
output stream for each file.
The main program itself is written in C#, but we already have an API that
allows us to make calls to hdf5 and MPI in C and C++. It retrieves data from an
external device, executes some analysis and then saves the data, and
parallelizing these three parts would speed up the process. However i’m not
quite sure how to implement such parallelization on the third bit:
So far i’ve seen that parallelization is usually implemented right off the bat:
the program is started with mpiexec (i’m on Windows), with a specified number
of processes. (like “mpiexec -n x Program.exe). Unfortunately running multiple
instances of the whole program in parallel would be problematic, but i’ve seen
that one should be able to spawn processes later during runtime with
MPI_Spawn(), indicating an executable as a target (provided that the “main”
process, the program itself, has been started with “mpiexec -n 1 Program.exe”
This second method could do it for us, but I was wondering if there is a more
elegant way to achieve parallel output writing, like calling a function from my
own program instead of an executable.
Bonus question, just to make sure i’ve got the basics of PHDF5 right in the
first place: I do need to have a process for each parallel action that I want
to perform in parallel, be it writing N streams to N files, or writing N
streams to a single file?
Thank you in advance
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
Hdf-forum is for HDF software users discussion.