Mark, even with multiple cores CPU (either with OpenMP or MPI), you can still use GPU. The current model uses GPU as an off-load device, each MPI drives one GPU.
Since Summit has a lot more memory on GPU (compared to the older GPUs), we can give a larger portion of the work to GPU. I'll write later about some env variables setup. Sherry On Tue, Apr 21, 2020 at 6:09 AM Mark Adams <[email protected]> wrote: > >> Odd, but it seems to work fine for me now. eg, I get a speedup of 6x on a >> ~50K equation 3D systems (Q3 elements with 2 dof per vertex). >> >> >> >> Mark, is it such speedup wrt the CPU version of SUPERLU_DIST? Or just the >> PETSc factorizations? >> > > > I was not clear. THis is on SUMMIT. One CPU and one GPU. SUPERLU_DIST-GPU > vs PETSc-CPU was about 6x faster. I have seen SUPERLU vs PETSc on the CPU > on smaller problems and PETSc was a little faster. > > Note, SUMMIT has 7 cores per GPU so it would be reasonable to > run SUPERLU_DIST-CPU on 7 cores, in which case the speedup would clearly be > gone, but that is not how I run this app. > > >> >> >>> I just updated the master branch with this fix. Will be absorbed in a >>> future release. >>> >>> As for PRNTlevel>=2, perhaps check your cmake build script. It should >>> be set to 0 for production build. >>> >>> >> I don't see where that gets set. PRNTlevel does not seem to be in our >> repo. I see it in 'MAKE_INC/make.cuda_gpu: -DDEBUGlevel=0 >> -DPRNTlevel=1 -DPROFlevel=0', but I think it is set at >= 2. I have >> manually disabled the print statements (~ 5 places). >> >> Thanks, >> Mark >> >> >>> Sherry >>> >>> >>> On Sun, Apr 19, 2020 at 6:32 PM Mark Adams <[email protected]> wrote: >>> >>>> Also, we have PRNTlevel>=2 in SuperLU_dist. This is causing a lot of >>>> output. It's not clear where that is set (it's a #define) >>>> >>>> On Sun, Apr 19, 2020 at 9:28 PM Mark Adams <[email protected]> wrote: >>>> >>>>> Sherry, I found the problem. >>>>> >>>>> I added this print statement to dDestroy_LU >>>>> >>>>> nb = CEILING(nsupers, grid->npcol); >>>>> for (i = 0; i < nb; ++i) >>>>> if ( Llu->Lrowind_bc_ptr[i] ) { >>>>> >>>>> * fprintf(stderr,"dDestroy_LU: GPU free Llu->Lnzval_bc_ptr[%d/%d] = >>>>> %p, CPU free Llu->Lrowind_bc_ptr = >>>>> %p\n",i,nb,Llu->Lnzval_bc_ptr[i],Llu->Lrowind_bc_ptr[i]);* >>>>> SUPERLU_FREE (Llu->Lrowind_bc_ptr[i]); >>>>> #ifdef GPU_ACC >>>>> checkCuda(cudaFreeHost(Llu->Lnzval_bc_ptr[i])); >>>>> #else >>>>> SUPERLU_FREE (Llu->Lnzval_bc_ptr[i]); >>>>> #endif >>>>> } >>>>> >>>>> And I see: >>>>> >>>>> 1 SNES Function norm 1.245977692562e-04 >>>>> >>>>> *dDestroy_LU: GPU free Llu->Lnzval_bc_ptr[0/134] = 0x4ff9b000, CPU >>>>> free Llu->Lrowind_bc_ptr = 0x4ff9a000*ex112d: cudahook.cc:762: >>>>> CUresult host_free_callback(void*): Assertion `cacheNode != __null' >>>>> failed. >>>>> >>>>> THis looks like Lnzval_bc_ptr is on the CPU so I removed the GPU_ACC >>>>> stuff and it works now. >>>>> >>>>> I see this in distribution. Perhaps this a serial run bug? >>>>> >>>>> On Sun, Apr 19, 2020 at 5:58 PM Xiaoye S. Li <[email protected]> wrote: >>>>> >>>>>> Mark, >>>>>> you should fork a branch of your own to do this. >>>>>> >>>>>> Sherry >>>>>> >>>>>> On Sun, Apr 19, 2020 at 2:54 PM Stefano Zampini < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> First, commit your changes to the superlu_dist branch, then rerun >>>>>>> configure with >>>>>>> >>>>>>> —download-superlu_dist-commit=HEAD >>>>>>> >>>>>>> >>>>>>> > On Apr 20, 2020, at 12:50 AM, Mark Adams <[email protected]> wrote: >>>>>>> > >>>>>>> > I would like to modify SuperLU_dist but if I change the source and >>>>>>> configure it says no need to reconfigure, use --force. I use --force >>>>>>> and it >>>>>>> seems to clobber my changes. Can I tell configure to use build but not >>>>>>> download SuperLU? >>>>>>> >>>>>>> >>
