Re: [Hdf-forum] Bail out on parallel hdf5 write or extend error

Mohamad Chaarawi Wed, 10 Jul 2013 11:23:09 -0700

Hi Ulrik,

On 7/10/2013 12:08 PM, [email protected] wrote:

Hi Mohamad,
What I'm trying to achieve is a graceful error handling in case'something' wrong happens. My parallel hdf5 writer application is theback end of a system which at the front reads data off a detector. Thefront end system will then pass on the data along with some metadatato describe where the 2D frame sits in the full dataset. So each MPIprocess sits on a separate server and each server is connected to onepiece of readout-electronic (front-end) which reads a 2D strip off thefull detector/camera.
Because errors just happens for various reasons in complex systems --especially ones in development (and in this case we are talking aboutseveral software and hardware subsystems working in supposedlybeautiful synchronisation) -- the writer must be able to recover evenfrom erroneous use -- for example if the frontend is sending an thewriter a bit of data with wrongly configured offsets, causing us to bewriting outside datasets (wrong offsets is just an example -- ofcourse I could make sanity checks for this particular case...)

I still think you are not making a distinction between programmableerrors and system/hardware errors.But if you want to recover from the case that one process fails in acollective write call, you will need to add fault tolerance yourself, bychecking the return status of every process, and communicating it to allthe ranks in the communicator.

The bottom line (or question) is really just: is there a way torecover somewhat gracefully if an error has happened on one or morenodes -- and what would be the consequences? (corrupt file?)

I do not know how to answer this question. There are many range oferrors, some of which can be recovered from; others well not so much.Add to that there is still no fault tolerance in MPI, so recovering fromMPI failures is hard.As for file corruption, again it depends on the error. If a failureprevents processes to close the file and flush the metadata cache, thenyes you will end up with a possible corrupt file.

Perhaps I need to switch off the collective mode? Would that allow meto close the file without having done an equal number of extend/writeon each node?

switching off collective mode means that dataset access operations(H5Dread/H5Dwrite) can be done independently. Other operations are stillrequired to be collective. see:

http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

Thanks,
Mohamad

Cheers,

Ulrik
*From:*Hdf-forum [mailto:[email protected]] *OnBehalf Of *Mohamad Chaarawi
*Sent:* 10 July 2013 17:46
*To:* 'HDF Users Discussion List'
*Subject:* Re: [Hdf-forum] Bail out on parallel hdf5 write or extend error

Hi Ulrik,
*From:*Hdf-forum [mailto:[email protected]] *OnBehalf Of *[email protected]<mailto:[email protected]>
*Sent:* Wednesday, July 10, 2013 11:01 AM
*To:* [email protected] <mailto:[email protected]>
*Subject:* [Hdf-forum] Bail out on parallel hdf5 write or extend error

Hello,
I am writing an application to stream data from multiple imagingdetectors, operating in a synchronised fashion into a single datasetin one HDF5 file. The dataset is a chunked 3D dataset where 2D (X,Y)images gets appended on the 3^rd dimension (dimensions are defined as:[Z, Y, X] -- so as the 2D frames are received I extend dimension Z.
At this point I need to work out how to deal with errors -- if onenode for some reason does something wrong like trying to write outsidethe dataset dimensions or whatever, I need to be able to close thecurrent file and return to the initial state, ready to create andwrite to another file.
Hmm, I'm not quite sure I understand what you are trying to achievehere. When you say that a node does something wrong like trying towrite outside the dataset dimensions, that implies an erroneousprogram and should be corrected, and not try and recover from it. Fromwhat I understand, you are attempting to use HDF5 erroneously, andcontinue to do that expecting a certain behaviour. This is not possible.
I might have misunderstood you because I'm not aware of the fulldetails about your use case here, i.e. why would you write outside thedataset dimensions.
For performance reasons I am using collective IO and so I think I needthe same number of extend, write, etc -- or the H5Fclose call willhang. Is this correct?
The hang would most probably not happen in H5Fclose if you don't callextend and write (if you set collective I/O) collectively. It willhappen in the extend or write itself, because a collective operationexpects all processes to be there at some point in time. If oneprocess does not call the operation and other processes attempt totalk to that process, then your program will hang.
Can I use the H5Pset_fclose_degree to set H5F_CLOSE_STRONG safely inthe parallel hdf5 case or will that cause a crash/hang/corrupt file?
I do not think this is relevant to what you are asking/require. Sureyou can set to H5F_CLOSE_STRONG, but that does not mean you can avoidcalling collective operations on all processes, or use the API in anerroneous manner.
Thanks,

Mohamad

Cheers,

Ulrik

---------------------------------------------------------------------

Ulrik Kofoed Pedersen

Senior Software Engineer

Diamond Light Source Ltd

Phone: 01235 77 8580

--
This e-mail and any attachments may contain confidential, copyrightand or privileged material, and are for the use of the intendedaddressee only. If you are not the intended addressee or an authorisedrecipient of the addressee please notify us of receipt by returningthe e-mail and do not use, copy, retain, distribute or disclose theinformation in or attached to the e-mail.Any opinions expressed within this e-mail are those of the individualand not necessarily of Diamond Light Source Ltd.Diamond Light Source Ltd. cannot guarantee that this e-mail or anyattachments are free from viruses and we cannot accept liability forany damage which you may sustain as a result of software viruses whichmay be transmitted in or with the message.Diamond Light Source Limited (company no. 4375679). Registered inEngland and Wales with its registered office at Diamond House, HarwellScience and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, UnitedKingdom
--
This e-mail and any attachments may contain confidential, copyrightand or privileged material, and are for the use of the intendedaddressee only. If you are not the intended addressee or an authorisedrecipient of the addressee please notify us of receipt by returningthe e-mail and do not use, copy, retain, distribute or disclose theinformation in or attached to the e-mail.Any opinions expressed within this e-mail are those of the individualand not necessarily of Diamond Light Source Ltd.Diamond Light Source Ltd. cannot guarantee that this e-mail or anyattachments are free from viruses and we cannot accept liability forany damage which you may sustain as a result of software viruses whichmay be transmitted in or with the message.Diamond Light Source Limited (company no. 4375679). Registered inEngland and Wales with its registered office at Diamond House, HarwellScience and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, UnitedKingdom
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Re: [Hdf-forum] Bail out on parallel hdf5 write or extend error

Reply via email to