Re: [Hdf-forum] How to include parallelization in the program

Miller, Mark C. Mon, 19 Feb 2018 11:17:43 -0800

Hi Stefano,

I am not sure I am understanding your question or the problem(s) you are hoping 
to solve.


You mention the “N:N case” and so I am assuming you are talking N processor 
writing to N files. You don’t need parallel HDF5 to do that. You can use serial 
HDF5 because each stream is a wholly independent file.

The only situation in which you *need* parallel HDF5 is when you want multiple 
MPI processes (parts of a distributed parallel executable) to write to the 
*same* file concurrently. Then, their work on the file has to be coordinated 
(e.g. creation of HDF5 objects) and their I/O to read/write data from/to 
objects in the file can be done either collectively (coordinated) or 
independently. But, it sounds like MPI parallelism is not really what you are 
looking for and that is especially true if you only want N:N case.

Now, if what you *really* want is multiple different OS processes (or maybe 
threads within a single process) to be able to write concurrently to a single 
file, then there are not really too many options for you *without* taking some 
degree of responsibility of coordinating those processes *yourself*. HDF5 will 
not do much to help you here. The support in HDF5 for calling it from multiple 
threads within the same executable is pretty limited. The locking is very 
coarse grained and in all likelihood winds up serializing the threads. And, 
there is nothing HDF5 itself (nor any other I/O library for that matter) can do 
if you want multiple different OS processes to write to the same file without 
doing a lot of the *work* yourself to coordinate them.

Finally, your note gives me the impression that maybe what you are looking for 
is one (or more) processes whose number grows (and maybe shrinks) over the life 
of the application and where each processor needs to write some data. If that 
is your ultimate goal, I think there are various ways you could try to 
implement that both with and without MPI and parallel HDF5. For example, if you 
went with MPI and had a loose upper bound on the total number of processes you 
needed, then you could mpiexec that number but then idle/sleep all those that 
don’t need to be running at a particular time….you could dynamically create MPI 
communicators that represent the current number of tasks you need and you could 
open and use a *single* HDF5 file on that communicator shared among those 
processors. Then, if you need to change the number of tasks, you would close 
the HDF5 file, close the communicator and create a new communicator on a 
different number of tasks and re-open the file with that new communicator. 
There is a lot involved there but I think it could be made to work. But, that 
is *only* if you want a single file that is routinely being written to by 
differing number of tasks. If you really just want N files from N tasks and N 
varies with time, then why not just use the OS to spawn I/O tasks and each task 
opens a uniquely named HDF5 perhaps by internal task id or something?

Finally, it might be worth having a look at this HDF5 Blog post….

https://www.hdfgroup.org/2017/03/mif-parallel-io-with-hdf5/

Not sure any of this is helpful but I thought I would mention some ideas.

Good luck.

Mark


"Hdf-forum on behalf of Stefano Salvadè" wrote:

Good morning everyone,

I’ve recently started using parallel HDF5 for my company, as we wish to save 
analysed data on multiple files at a time. It would be an N:N case, with an 
output stream for each file.
The main program itself is written in C#, but we already have an API that 
allows us to make calls to hdf5 and MPI in C and C++. It retrieves data from an 
external device, executes some analysis and then saves the data, and 
parallelizing these three parts would speed up the process. However i’m not 
quite sure how to implement such parallelization on the third bit:
So far i’ve seen that parallelization is usually implemented right off the bat: 
the program is started with mpiexec (i’m on Windows), with a specified number 
of processes. (like “mpiexec -n x Program.exe). Unfortunately running multiple 
instances of the whole program in parallel would be problematic, but i’ve seen 
that one should be able to spawn processes later during runtime with 
MPI_Spawn(), indicating an executable as a target (provided that the “main” 
process, the program itself, has been started with “mpiexec -n 1 Program.exe” 
for example).
This second method could do it for us, but I was wondering if there is a more 
elegant way to achieve parallel output writing, like calling a function from my 
own program instead of an executable.

Bonus question, just to make sure i’ve got the basics of PHDF5 right in the 
first place: I do need to have a process for each parallel action that I want 
to perform in parallel, be it writing N streams to N files, or writing N 
streams to a single file?

Thank you in advance

Stefano

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] How to include parallelization in the program

Reply via email to