Hmm I suppose this means Kokkos should accept a stream like we expect it to? According to this somewhat recent merged PR: https://github.com/kokkos/kokkos/pull/1919 <https://github.com/kokkos/kokkos/pull/1919> you can now make a "Kokkos::Cuda” object, and pass it as arg1 to range policies as an execution space. Here’s what I found on it (the cuda specific one is useless):
https://github.com/kokkos/kokkos/wiki/ExecutionSpaceConcept <https://github.com/kokkos/kokkos/wiki/ExecutionSpaceConcept> https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3AExecutionSpaceConcept <https://github.com/kokkos/kokkos/wiki/Kokkos::ExecutionSpaceConcept> https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3ACuda <https://github.com/kokkos/kokkos/wiki/Kokkos::Cuda> <—— cuda specific Best regards, Jacob Faibussowitsch (Jacob Fai - booss - oh - vitch) Cell: (312) 694-3391 > On Jan 11, 2021, at 10:35, Mark Adams <[email protected]> wrote: > > Jacob, I'm not sure I understand this response. I could not find you on the > Kokkos slack channel. > > Me: And My colleague in PETSc, Jacob Faibussowitsch, has talked to you about > Kokkos taking a Cuda, Hip, etc., stream. This is something that would make it > easier to deal with asynchronous GPU solvers in PETSc. We just wanted to > check on this. > > Trott: Kokkos itself can do it for practically every operation > > Maybe you want to talk with him at some point, but we can worry about getting > Cuda to work for now. > > On Sun, Jan 10, 2021 at 2:28 PM Jacob Faibussowitsch <[email protected] > <mailto:[email protected]>> wrote: > I would like as much as possible to pass the cuda and hip streams to Kokkos, > since I can directly handle much of the annoyance with wrangling multiple > streams and stream objects externally. Last I checked on this Kokkos was > moving towards allowing association of streams to functions, but admittedly > this was a while back. > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > Cell: (312) 694-3391 > >> On Jan 10, 2021, at 13:10, Mark Adams <[email protected] >> <mailto:[email protected]>> wrote: >> >> >> >> On Sat, Jan 9, 2021 at 7:37 PM Jacob Faibussowitsch <[email protected] >> <mailto:[email protected]>> wrote: >> It is a single object that holds a pointer to every stream implementation >> and toggleable type so it can be universally passed around. Currently has a >> cudaStream and a hipStream but this is easily extendable to any other stream >> implementation. >> >> Do you have any thoughts on how this would work with Kokkos? >> >> Would you want to feed Kokkos your Cuda/Hip, etc, stream or add a Kokkos >> backend to your object? >> >> Junchao might be the person to ask. I would guess Kokkos View (vector) >> objects carry a stream because they block on a "deep_copy", that moves data >> to/from the GPU, and it is blocking. >> >> Thanks, >> Mark >> >> >> Best regards, >> >> Jacob Faibussowitsch >> (Jacob Fai - booss - oh - vitch) >> Cell: +1 (312) 694-3391 >> >>> On Jan 9, 2021, at 18:19, Mark Adams <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> >>> Is this stream object going to have Cuda, Kokkos, etc., implementations? >>> >>> On Sat, Jan 9, 2021 at 4:09 PM Jacob Faibussowitsch <[email protected] >>> <mailto:[email protected]>> wrote: >>> I’m currently working on an implementation of a general PetscStream object. >>> Currently it only supports Vector ops and has a proof of concept KSPCG, but >>> should be extensible to other objects when finished. Junchao is also >>> indirectly working on pipeline support in his NVSHMEM MR. Take a look at >>> either MR, it would be very useful to get your input, as tailoring either >>> of these approaches for pipelined algorithms is key. >>> >>> Best regards, >>> >>> Jacob Faibussowitsch >>> (Jacob Fai - booss - oh - vitch) >>> Cell: (312) 694-3391 >>> >>>> On Jan 9, 2021, at 15:01, Mark Adams <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> I would like to put a non-overlapping ASM solve on the GPU. It's not clear >>>> that we have a model for this. >>>> >>>> PCApply_ASM currently pipelines the scater with the subdomain solves. I >>>> think we would want to change this and do a 1) scatter begin loop, 2) >>>> scatter end and non-blocking solve loop, 3) solve-wait and scatter begging >>>> loop and 4) scatter end loop. >>>> >>>> I'm not sure how to go about doing this. >>>> * Should we make a new PCApply_ASM_PARALLEL or dump this pipelining >>>> algorithm and rewrite PCApply_ASM? >>>> * Add a solver-wait method to KSP? >>>> >>>> Thoughts? >>>> >>>> Mark >>> >
