Hi,
Since we cannot post issues (reported here
https://forum.gitlab.com/t/creating-new-issue-gives-cannot-create-issue-getting-whoops-something-went-wrong-on-our-end/41966?u=bsmith)
here is my issue so I don't forget it.
I think
err = WaitForCUDA();CHKERRCUDA(err);
ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
should be changed to include WaitForCUDA() actually WaitForDevice()
inside the PetscLogGpuTimeEnd().
Currently sometimes the WaitForCUDA() is missing in a few places
resulting in bad timing.
Also some _SeqCUDA() don't have the PetscLogGpuTimeEnd() and need to be
fixed.
The current model is a maintenance nightmare.
Does anyone see a problem with making this change?
I'm fine with this change, as the maintenance benefits outweigh the
performance cost for typical use cases.
I propose to also add the WaitForDevice(); at PetscLogGpuTimeBegin().
This will ensure that no previous GPU kernel executions spill over into
the timed section.
Best regards,
Karli