Re: [petsc-users] Communication during MatAssemblyEnd

Jose E. Roman via petsc-users Mon, 01 Jul 2019 02:28:25 -0700

You can try the following:
- Try with a different DS method: -ds_method 1  or  -ds_method 2  (see 
DSSetMethod)
- Run with -ds_parallel synchronized (see DSSetParallel)
If it does not help, send a reproducible code to slepc-maint


Jose


> El 1 jul 2019, a las 11:10, Ale Foggia via petsc-users 
> <[email protected]> escribió:
> 
> Oh, I also got the same error when I switched to the newest version of SLEPc 
> (using OpenBlas), and I don't know where it is coming from.
> Can you tell me which version of SLEPc and PETSc are you using? And, are you 
> using MKL?
> Thanks for trying :)
> 
> El vie., 28 jun. 2019 a las 16:57, Zhang, Junchao (<[email protected]>) 
> escribió:
> Ran with 64 nodes and 32 ranks/node, met  slepc errors and did not know how 
> to proceed :(
> 
> [363]PETSC ERROR: --------------------- Error Message 
> --------------------------------------------------------------
> [363]PETSC ERROR: Error in external library
> [363]PETSC ERROR: Error in LAPACK subroutine steqr: info=0
> [363]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html 
> for trouble shooting.
> [363]PETSC ERROR: Petsc Development GIT revision: v3.11.2-1052-gf1480a5c  GIT 
> Date: 2019-06-22 21:39:54 +0000
> [363]PETSC ERROR: /tmp/main.x on a arch-cray-xc40-knl-opt named nid03387 by 
> jczhang Fri Jun 28 07:26:59 2019
> [1225]PETSC ERROR: #2 DSSolve() line 586 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [1225]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [1225]PETSC ERROR: #4 EPSSolve() line 149 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
> [240]PETSC ERROR: #2 DSSolve() line 586 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [240]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [240]PETSC ERROR: #4 EPSSolve() line 149 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
> 
> --Junchao Zhang
> 
> 
> On Fri, Jun 28, 2019 at 4:02 AM Ale Foggia <[email protected]> wrote:
> Junchao,
> I'm sorry for the late response.
> 
> El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao (<[email protected]>) 
> escribió:
> Ale,
> The job got a chance to run but failed with out-of-memory, "Some of your 
> processes may have been killed by the cgroup out-of-memory handler."
> 
> I mentioned that I used 1024 nodes and 32 processes on each node because the 
> application needs a lot of memory. I think that for a system of size 38, one 
> needs above 256 nodes for sure (assuming only 32 procs per node). I would try 
> with 512 if it's possible.
> 
> I also tried with 128 core with ./main.x 2 ... and got a weird error message  
> "The size of the basis has to be at least equal to the number                 
>                  of MPI processes used."
> 
> The error comes from the fact that you put a system size of only 2 which is 
> too small.
> I can also see the problem in the assembly with system sizes smaller than 38, 
> so you can try with like 30 (for which I also have a log). In that case I run 
> with 64 nodes and 32 processes per node. I think the problem may also fit in 
> 32 nodes.
> 
> --Junchao Zhang
> 
> 
> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang <[email protected]> wrote:
> Ale,
>   I successfully built your code and submitted a job to the NERSC Cori 
> machine requiring 32768 KNL cores and one and a half hours. It is estimated 
> to run in 3 days. If you also observed the same problem with less cores, what 
> is your input arguments?  Currently, I use what in your log file, ./main.x 38 
> -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9 -log_view
>   The smaller the better. Thanks.
> --Junchao Zhang
> 
> 
> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia <[email protected]> wrote:
> Yes, I used KNL nodes. I you can perform the test would be great. Could it be 
> that I'm not using the correct configuration of the KNL nodes? These are the 
> environment variables I set:
> MKL_NUM_THREADS=1
> OMP_NUM_THREADS=1
> KMP_HW_SUBSET=1t
> KMP_AFFINITY=compact
> I_MPI_PIN_DOMAIN=socket
> I_MPI_PIN_PROCESSOR_LIST=0-63
> MKL_DYNAMIC=0
> 
> The code is in https://github.com/amfoggia/LSQuantumED and it has a readme to 
> compile it and run it. When I run the test I used only 32 processors per 
> node, and I used 1024 nodes in total, and it's for nspins=38.
> Thank you
> 
> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao (<[email protected]>) 
> escribió:
> Ale,
>   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes 
>  one year ago. He used 32768 processors and called MatAssemblyEnd 118 times 
> and it used only 1.5 seconds in total.  So I guess something was wrong with 
> your test. If you can share your code, I can have a test on our machine to 
> see how it goes.
>  Thanks.
> --Junchao Zhang
> 
> 
> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang <[email protected]> wrote:
> MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  
> Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does 
> global synchronization. The high max/min ratio means load imbalance. What I 
> do not understand is MatAssemblyEnd. The ratio is 1.0. It means processors 
> are already synchronized. With 32768 processors, there are 1.2e+06 messages 
> with average length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) 
> ~2MB messages and it takes 54 seconds. Another chance is the reduction at  
> MatAssemblyEnd. I don't know why it needs 8 reductions. In my mind, one is 
> enough. I need to look at the code.
> 
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
> -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count   %Total   
>   Avg         %Total    Count   %Total 
>  0:      Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
> 7.768e+06       13.1%  2.530e+02  13.0% 
>  1:    Create Basis: 7.9234e-02   0.0%  0.0000e+00   0.0%  0.000e+00   0.0%  
> 0.000e+00        0.0%  0.000e+00   0.0% 
>  2:  Create Lattice: 8.3944e-05   0.0%  0.0000e+00   0.0%  0.000e+00   0.0%  
> 0.000e+00        0.0%  0.000e+00   0.0% 
>  3:   Create Hamilt: 1.0694e+02   1.6%  0.0000e+00   0.0%  0.000e+00   0.0%  
> 0.000e+00        0.0%  2.000e+00   0.1% 
>  5:         Offdiag: 1.6525e+02   2.5%  0.0000e+00   0.0%  1.188e+06   0.2%  
> 1.942e+06        0.0%  8.000e+00   0.4% 
>  6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
> 7.768e+06       86.8%  1.674e+03  86.1% 
> 
> --- Event Stage 5: Offdiag
> BuildTwoSidedF         1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  28  0  0  0  0     0
> MatAssemblyBegin       1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0  28  0  0  0  0     0
> MatAssemblyEnd         1 1.0 5.3762e+01 1.0      0.00e+00 0.0 1.2e+06 1.9e+06 
> 8.0e+00  1  0  0  0  0  33  0100100100     0
> VecSet                 1 1.0 7.5533e-02 9.0      0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> 
> --Junchao Zhang
> 
> 
> On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. <[email protected]> wrote:
> 
>    The load balance is definitely out of whack. 
> 
> 
> 
> BuildTwoSidedF         1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMult              138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
> 0.0e+00  2  4 13 13  0  15 25100100  0 2935476
> MatAssemblyBegin       1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecNorm                2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
> 2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
> VecCopy                6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY                2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
> VecScatterBegin      138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
> 0.0e+00  1  0 13 13  0   4  0100100  0     0
> VecScatterEnd        138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   4  0  0  0  0     0
> VecSetRandom           1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
> balance ratio above 2 which means some processes have more than twice the 
> work of others. Meanwhile the ratio for anything with communication is 
> extremely in balanced, some processes get to the synchronization point well 
> before other processes. 
> 
> The first thing I would do is worry about the load imbalance, what is its 
> cause? is it one process with much less work than others (not great but not 
> terrible) or is it one process with much more work then the others (terrible) 
> or something in between. I think once you get a handle on the load balance 
> the rest may fall into place, otherwise we still have some exploring to do. 
> This is not expected behavior for a good machine with a good network and a 
> well balanced job. After you understand the load balancing you may need to 
> use one of the parallel performance visualization tools to see why the 
> synchronization is out of whack.
> 
>    Good luck
> 
>   Barry
> 
> 
> > On Jun 21, 2019, at 9:27 AM, Ale Foggia <[email protected]> wrote:
> > 
> > I'm sending one with a bit less time.
> > I'm timing the functions also with std::chronos and for the case of 180 
> > seconds the program runs out of memory (and crushes) before the PETSc log 
> > gets to be printed, so I know the time only from my function. Anyway, in 
> > every case, the times between std::chronos and the PETSc log match.
> > 
> > (The large times are in part "4b- Building offdiagonal part" or "Event 
> > Stage 5: Offdiag").
> > 
> > El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao (<[email protected]>) 
> > escribió:
> > 
> > 
> > On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia <[email protected]> wrote:
> > Thanks both of you for your answers,
> > 
> > El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. (<[email protected]>) 
> > escribió:
> > 
> >   Note that this is a one time cost if the nonzero structure of the matrix 
> > stays the same. It will not happen in future MatAssemblies.
> > 
> > > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
> > > <[email protected]> wrote:
> > > 
> > > Those messages were used to build MatMult communication pattern for the 
> > > matrix. They were not part of the matrix entries-passing you imagined, 
> > > but indeed happened in MatAssemblyEnd. If you want to make sure 
> > > processors do not set remote entries, you can use 
> > > MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate 
> > > an error when an off-proc entry is set.
> > 
> > I started being concerned about this when I saw that the assembly was 
> > taking a few hundreds of seconds in my code, like 180 seconds, which for me 
> > it's a considerable time. Do you think (or maybe you need more information 
> > to answer this) that this time is "reasonable" for communicating the 
> > pattern for the matrix? I already checked that I'm not setting any remote 
> > entries. 
> > It is not reasonable. Could you send log view of that test with 180 seconds 
> > MatAssembly?
> >  
> > Also I see (in my code) that even if there are no messages being passed 
> > during MatAssemblyBegin, it is taking time and the "ratio" is very big.
> > 
> > > 
> > > 
> > > --Junchao Zhang
> > > 
> > > 
> > > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
> > > <[email protected]> wrote:
> > > Hello all!
> > > 
> > > During the conference I showed you a problem happening during 
> > > MatAssemblyEnd in a particular code that I have. Now, I tried the same 
> > > with a simple code (a symmetric problem corresponding to the Laplacian 
> > > operator in 1D, from the SLEPc Hands-On exercises). As I understand (and 
> > > please, correct me if I'm wrong), in this case the elements of the matrix 
> > > are computed locally by each process so there should not be any 
> > > communication during the assembly. However, in the log I get that there 
> > > are messages being passed. Also, the number of messages changes with the 
> > > number of processes used and the size of the matrix. Could you please 
> > > help me understand this?
> > > 
> > > I attach the code I used and the log I get for a small problem.
> > > 
> > > Cheers,
> > > Ale
> > > 
> > 
> > <log.txt>
>

Re: [petsc-users] Communication during MatAssemblyEnd

Reply via email to