> On Nov 26, 2014, at 3:35 PM, Håkon Strandenes <[email protected]> wrote: > >> > Is it really PETSc's taks to warn about this? PETSc should trust HDF5 to > "just work" and HDF5 should actually print sensible warnings/error messages. > Shouldn't it?
Yes, but if we produce a nice error message it makes everyone's lives easier, including ours because we don't have to constantly answer emails about the same problem discovered by a new person over and over again. Hence we do this kind of thing a lot. Barry > > I'll think about that system command until tomorrow... > >> Thanks >> >> Barry >> >> A big FAT error message is always better than a FAQ when possible. > Of course. > > Håkon > >> >> >>> >>> I have found that setting this to at least 32 will make my examples run >>> perfectly on up to 256 processes. No error messages what so ever, and in my >>> simple load and write dataset roundtrip h5diff compares the two datasets >>> and finds then identical. I also notice that Leibniz Rechenzentrum >>> recommend to set this variable to 100 (or some other suitably large value) >>> when using NetCDF together with MPT >>> (https://www.lrz.de/services/software/io/netcdf/). >>> >>> This bug have been a pain in the (***)... Perhaps it is worthy a FAQ entry? >>> >>> Thanks for your time and effort. >>> >>> Regards, >>> Håkon Strandenes >>> >>> >>> On 26. nov. 2014 08:01, Håkon Strandenes wrote: >>>> >>>> >>>> On 25. nov. 2014 22:40, Matthew Knepley wrote: >>>>> On Tue, Nov 25, 2014 at 2:34 PM, Håkon Strandenes <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> (...) >>>>> >>>>> First, this is great debugging. >>>> >>>> Thanks. >>>> >>>>> >>>>> Second, my reading of the HDF5 document you linked to says that either >>>>> selection should be valid: >>>>> >>>>> "For non-regular hyperslab selection, parallel HDF5 uses independent >>>>> IO internally for this option." >>>>> >>>>> so it ought to fall back to the INDEPENDENT model if it can't do >>>>> collective calls correctly. However, >>>>> it appears that the collective call has bugs. >>>>> >>>>> My conclusion: Since you have determined that changing the setting to >>>>> INDEPENDENT produces >>>>> correct input/output in all the test cases, and since my understanding >>>>> of the HDF5 documentation is >>>>> that we should always be able to use COLLECTIVE as an option, this is an >>>>> HDF5 or MPT bug. >>>> >>>> I have conducted yet another test: >>>> My example (ex10) that I previously posted to the mailing list was set >>>> up with 250 grid points along each axis. When the topic on chunking was >>>> brought to the table, I realized that 250 is not evenly dividable on >>>> four. The example failed on 64 processes, that is four processes along >>>> each direction (the division is 62 + 62 + 63 + 63 = 250). >>>> >>>> So I have recompiled "my ex10" with 256 gridpoints in each direction. It >>>> turns out that this does in deed run successfully on 64 nodes. Great! It >>>> also runs on 128 processes, that is a 8x4x4 decomposition. However it >>>> does not run on 125 processes, that is a 5x5x5 decomposition. >>>> >>>> The same pattern is clear if I run my example with 250^3 grid points. It >>>> does not run on numbers like 64 and 128, but does run successfully on >>>> 125 processes, again only when the sub-domains are of exactly equal size >>>> (in this case the domain is divided as 5x5x5). >>>> >>>> However, I still believe that there is bugs. I did my "roundtrip" by >>>> loading a dataset and immediately writing the same dataset to a >>>> different file, this time a 250^3 dataset on 125 processes. It did not >>>> "pass" this test, i.e. the written dataset was just garbage. I have not >>>> yet identified if the garbling is introduced in the reading or writing >>>> of the dataset. >>>> >>>>> >>>>> Does anyone else see the HDF5 differently? Also, it really looks to me >>>>> like HDF5 messed up the MPI >>>>> data type in the COLLECTIVE picture below, since it appears to be sliced >>>>> incorrectly. >>>>> >>>>> Possible Remedies: >>>>> >>>>> 1) We can allow you to turn off H5Pset_dxpl_mpio() >>>>> >>>>> 2) Send this test case to the MPI/IO people at ANL >>>>> >>>>> If you think 1) is what you want, we can do it. If you can package this >>>>> work for 2), it would be really valuable. >>>> >>>> I will be fine editing gr2.c manually each time this file is changed (I >>>> use the sources from Git). But *if* this not a bug in MPT, but a bug in >>>> PETSc or HDF5 it should be fixed... Because it is that kind of bug that >>>> is extremely annoying and a read pain to track down. >>>> >>>> Perhaps the HDF5 mailing list could contribute in this issue? >>>> >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> Tanks for your time. >>>>> >>>>> Best regards, >>>>> Håkon Strandenes >>>>> >>>>> >>>> >>>> Again thanks for your time. >>>> >>>> Regards, >>>> Håkon >>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which >>>>> their experiments lead. >>>>> -- Norbert Wiener >> >>
