the requirements for PHDF5 are here
<https://support.hdfgroup.org/HDF5/PHDF5/>. It may be good idea to check
whether there is a speed up from parllelFS + phdf5 set up.
In my interpretation there is a benefit to use PHDF5 when you have a full
parallel system backed with parallel file system which capable handling IO
parallel: large super-computing (batch) environments are such.
On the other end of the spectrum you can have a single computer, single
drive system with multiple cores; AWS EC2 instances without local HDD are
If the latter case using PHDF5 you pull into extra code lines and some
restrictions (no filters, ... ) as you see at some choke point there must
be a mechanism to serialise all the READ/WRITE operations.
If you have the latter setup using a separate process and a reliable
software fabric ( ie: ZeroMQ + protocol buffer or similar queue ) get you
There also is another approach: to write into separate files, local in
1) then copy all files into one single HDF5 container
2) use a separate HDF5 with external file driver to link the files into a
The copy/collect version works on batch processors if your 'collector'
script is scheduled after the MPI job.
Of course in case you are having a true parallel environment indeed you
should benefit from parallel IO.
On Mon, Feb 19, 2018 at 3:10 AM, Stefano Salvadè <stefano.salv...@3brain.com
> Good morning everyone,
> I’ve recently started using parallel HDF5 for my company, as we wish to
> save analysed data on multiple files at a time. It would be an N:N case,
> with an output stream for each file.
> The main program itself is written in C#, but we already have an API that
> allows us to make calls to hdf5 and MPI in C and C++. It retrieves data
> from an external device, executes some analysis and then saves the data,
> and parallelizing these three parts would speed up the process. However i’m
> not quite sure how to implement such parallelization on the third bit:
> So far i’ve seen that parallelization is usually implemented right off the
> bat: the program is started with mpiexec (i’m on Windows), with a specified
> number of processes. (like “mpiexec -n x Program.exe). Unfortunately
> running multiple instances of the whole program in parallel would be
> problematic, but i’ve seen that one should be able to spawn processes later
> during runtime with MPI_Spawn(), indicating an executable as a target
> (provided that the “main” process, the program itself, has been started
> with “mpiexec -n 1 Program.exe” for example).
> This second method could do it for us, but I was wondering if there is a
> more elegant way to achieve parallel output writing, like calling a
> function from my own program instead of an executable.
> Bonus question, just to make sure i’ve got the basics of PHDF5 right in
> the first place: I do need to have a process for each parallel action that
> I want to perform in parallel, be it writing N streams to N files, or
> writing N streams to a single file?
> Thank you in advance
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
> Hdf-forum is for HDF software users discussion.
> Twitter: https://twitter.com/hdf5
Hdf-forum is for HDF software users discussion.