Hi Mohamad,

What I'm trying to achieve is a graceful error handling in case 'something' 
wrong happens. My parallel hdf5 writer application is the back end of a system 
which at the front reads data off a detector. The front end system will then 
pass on the data along with some metadata to describe where the 2D frame sits 
in the full dataset. So each MPI process sits on a separate server and each 
server is connected to one piece of readout-electronic (front-end) which reads 
a 2D strip off the full detector/camera.

Because errors just happens for various reasons in complex systems - especially 
ones in development (and in this case we are talking about several software and 
hardware subsystems working in supposedly beautiful synchronisation) - the 
writer must be able to recover even from erroneous use - for example if the 
frontend is sending an the writer a bit of data with wrongly configured 
offsets, causing us to be writing outside datasets (wrong offsets is just an 
example - of course I could make sanity checks for this particular case...)

The bottom line (or question) is really just: is there a way to recover 
somewhat gracefully if an error has happened on one or more nodes - and what 
would be the consequences? (corrupt file?)

Perhaps I need to switch off the collective mode? Would that allow me to close 
the file without having done an equal number of extend/write on each node?

Cheers,
Ulrik

From: Hdf-forum [mailto:[email protected]] On Behalf Of 
Mohamad Chaarawi
Sent: 10 July 2013 17:46
To: 'HDF Users Discussion List'
Subject: Re: [Hdf-forum] Bail out on parallel hdf5 write or extend error

Hi Ulrik,


From: Hdf-forum [mailto:[email protected]] On Behalf Of 
[email protected]<mailto:[email protected]>
Sent: Wednesday, July 10, 2013 11:01 AM
To: [email protected]<mailto:[email protected]>
Subject: [Hdf-forum] Bail out on parallel hdf5 write or extend error

Hello,

I am writing an application to stream data from multiple imaging detectors, 
operating in a synchronised fashion into a single dataset in one HDF5 file. The 
dataset is a chunked 3D dataset where 2D (X,Y) images gets appended on the 3rd 
dimension (dimensions are defined as: [Z, Y, X] - so as the 2D frames are 
received I extend dimension Z.

At this point I need to work out how to deal with errors - if one node for some 
reason does something wrong like trying to write outside the dataset dimensions 
or whatever, I need to be able to close the current file and return to the 
initial state, ready to create and write to another file.

Hmm, I'm not quite sure I understand what you are trying to achieve here. When 
you say that a node does something wrong like trying to write outside the 
dataset dimensions, that implies an erroneous program and should be corrected, 
and not try and recover from it. From what I understand, you are attempting to 
use HDF5 erroneously, and continue to do that expecting a certain behaviour. 
This is not possible.
I might have misunderstood you because I'm not aware of the full details about 
your use case here, i.e. why would you write outside the dataset dimensions.

For performance reasons I am using collective IO and so I think I need the same 
number of extend, write, etc - or the H5Fclose call will hang. Is this correct?

The hang would most probably not happen in H5Fclose if you don't call extend 
and write (if you set collective I/O) collectively. It will happen in the 
extend or write itself, because a collective operation expects all processes to 
be there at some point in time. If one process does not call the operation and 
other processes attempt to talk to that process, then your program will hang.

Can I use the H5Pset_fclose_degree to set H5F_CLOSE_STRONG safely in the 
parallel hdf5 case or will that cause a crash/hang/corrupt file?

I do not think this is relevant to what you are asking/require. Sure you can 
set to H5F_CLOSE_STRONG, but that does not mean you can avoid calling 
collective operations on all processes, or use the API in an erroneous manner.

Thanks,
Mohamad

Cheers,
Ulrik


---------------------------------------------------------------------
Ulrik Kofoed Pedersen
Senior Software Engineer
Diamond Light Source Ltd
Phone: 01235 77 8580





--

This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom





-- 

This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.

Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 

Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.

Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

 







_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Reply via email to