Re: [petsc-users] Problem with PETSc + HDF5 VecView

Barry Smith Wed, 26 Nov 2014 14:37:51 -0800

> On Nov 26, 2014, at 3:35 PM, Håkon Strandenes <[email protected]> wrote:
> 
>> 
> Is it really PETSc's taks to warn about this? PETSc should trust HDF5 to 
> "just work" and HDF5 should actually print sensible warnings/error messages. 
> Shouldn't it?


   Yes, but if we produce a nice error message it makes everyone's lives 
easier, including ours because we don't have to constantly answer emails about 
the same problem discovered by a new person over and over again. Hence we do 
this kind of thing a lot.

  Barry

> 
> I'll think about that system command until tomorrow...
> 
>>   Thanks
>> 
>>    Barry
>> 
>> A big FAT error message is always better than a FAQ when possible.
> Of course.
> 
> Håkon
> 
>> 
>> 
>>> 
>>> I have found that setting this to at least 32 will make my examples run 
>>> perfectly on up to 256 processes. No error messages what so ever, and in my 
>>> simple load and write dataset roundtrip h5diff compares the two datasets 
>>> and finds then identical. I also notice that Leibniz Rechenzentrum 
>>> recommend to set this variable to 100 (or some other suitably large value) 
>>> when using NetCDF together with MPT 
>>> (https://www.lrz.de/services/software/io/netcdf/).
>>> 
>>> This bug have been a pain in the (***)... Perhaps it is worthy a FAQ entry?
>>> 
>>> Thanks for your time and effort.
>>> 
>>> Regards,
>>> Håkon Strandenes
>>> 
>>> 
>>> On 26. nov. 2014 08:01, Håkon Strandenes wrote:
>>>> 
>>>> 
>>>> On 25. nov. 2014 22:40, Matthew Knepley wrote:
>>>>> On Tue, Nov 25, 2014 at 2:34 PM, Håkon Strandenes <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> (...)
>>>>> 
>>>>> First, this is great debugging.
>>>> 
>>>> Thanks.
>>>> 
>>>>> 
>>>>> Second, my reading of the HDF5 document you linked to says that either
>>>>> selection should be valid:
>>>>> 
>>>>>   "For non-regular hyperslab selection, parallel HDF5 uses independent
>>>>> IO internally for this option."
>>>>> 
>>>>> so it ought to fall back to the INDEPENDENT model if it can't do
>>>>> collective calls correctly. However,
>>>>> it appears that the collective call has bugs.
>>>>> 
>>>>> My conclusion: Since you have determined that changing the setting to
>>>>> INDEPENDENT produces
>>>>> correct input/output in all the test cases, and since my understanding
>>>>> of the HDF5 documentation is
>>>>> that we should always be able to use COLLECTIVE as an option, this is an
>>>>> HDF5 or MPT bug.
>>>> 
>>>> I have conducted yet another test:
>>>> My example (ex10) that I previously posted to the mailing list was set
>>>> up with 250 grid points along each axis. When the topic on chunking was
>>>> brought to the table, I realized that 250 is not evenly dividable on
>>>> four. The example failed on 64 processes, that is four processes along
>>>> each direction (the division is 62 + 62 + 63 + 63 = 250).
>>>> 
>>>> So I have recompiled "my ex10" with 256 gridpoints in each direction. It
>>>> turns out that this does in deed run successfully on 64 nodes. Great! It
>>>> also runs on 128 processes, that is a 8x4x4 decomposition. However it
>>>> does not run on 125 processes, that is a 5x5x5 decomposition.
>>>> 
>>>> The same pattern is clear if I run my example with 250^3 grid points. It
>>>> does not run on numbers like 64 and 128, but does run successfully on
>>>> 125 processes, again only when the sub-domains are of exactly equal size
>>>> (in this case the domain is divided as 5x5x5).
>>>> 
>>>> However, I still believe that there is bugs. I did my "roundtrip" by
>>>> loading a dataset and immediately writing the same dataset to a
>>>> different file, this time a 250^3 dataset on 125 processes. It did not
>>>> "pass" this test, i.e. the written dataset was just garbage. I have not
>>>> yet identified if the garbling is introduced in the reading or writing
>>>> of the dataset.
>>>> 
>>>>> 
>>>>> Does anyone else see the HDF5 differently? Also, it really looks to me
>>>>> like HDF5 messed up the MPI
>>>>> data type in the COLLECTIVE picture below, since it appears to be sliced
>>>>> incorrectly.
>>>>> 
>>>>> Possible Remedies:
>>>>> 
>>>>>   1) We can allow you to turn off H5Pset_dxpl_mpio()
>>>>> 
>>>>>   2) Send this test case to the MPI/IO people at ANL
>>>>> 
>>>>> If you think 1) is what you want, we can do it. If you can package this
>>>>> work for 2), it would be really valuable.
>>>> 
>>>> I will be fine editing gr2.c manually each time this file is changed (I
>>>> use the sources from Git). But *if* this not a bug in MPT, but a bug in
>>>> PETSc or HDF5 it should be fixed... Because it is that kind of bug that
>>>> is extremely annoying and a read pain to track down.
>>>> 
>>>> Perhaps the HDF5 mailing list could contribute in this issue?
>>>> 
>>>>> 
>>>>>   Thanks,
>>>>> 
>>>>>     Matt
>>>>> 
>>>>>    Tanks for your time.
>>>>> 
>>>>>    Best regards,
>>>>>    Håkon Strandenes
>>>>> 
>>>>> 
>>>> 
>>>> Again thanks for your time.
>>>> 
>>>> Regards,
>>>> Håkon
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which
>>>>> their experiments lead.
>>>>> -- Norbert Wiener
>> 
>>

Re: [petsc-users] Problem with PETSc + HDF5 VecView

Reply via email to