Ale,
The job got a chance to run but failed with out-of-memory, "Some of your
processes may have been killed by the cgroup out-of-memory handler."
I also tried with 128 core with ./main.x 2 ... and got a weird error message
"The size of the basis has to be at least equal to the number
of MPI processes used."
--Junchao Zhang
On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang
<[email protected]<mailto:[email protected]>> wrote:
Ale,
I successfully built your code and submitted a job to the NERSC Cori machine
requiring 32768 KNL cores and one and a half hours. It is estimated to run in 3
days. If you also observed the same problem with less cores, what is your input
arguments? Currently, I use what in your log file, ./main.x 38 -nn -j1 1.0 -d1
1.0 -eps_type krylovschur -eps_tol 1e-9 -log_view
The smaller the better. Thanks.
--Junchao Zhang
On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia
<[email protected]<mailto:[email protected]>> wrote:
Yes, I used KNL nodes. I you can perform the test would be great. Could it be
that I'm not using the correct configuration of the KNL nodes? These are the
environment variables I set:
MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
KMP_HW_SUBSET=1t
KMP_AFFINITY=compact
I_MPI_PIN_DOMAIN=socket
I_MPI_PIN_PROCESSOR_LIST=0-63
MKL_DYNAMIC=0
The code is in https://github.com/amfoggia/LSQuantumED and it has a readme to
compile it and run it. When I run the test I used only 32 processors per node,
and I used 1024 nodes in total, and it's for nspins=38.
Thank you
El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao
(<[email protected]<mailto:[email protected]>>) escribió:
Ale,
Did you use Intel KNL nodes? Mr. Hong (cc'ed) did experiments on KNL nodes
one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and
it used only 1.5 seconds in total. So I guess something was wrong with your
test. If you can share your code, I can have a test on our machine to see how
it goes.
Thanks.
--Junchao Zhang
On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang
<[email protected]<mailto:[email protected]>> wrote:
MatAssembly was called once (in stage 5) and cost 2.5% of the total time. Look
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global
synchronization. The high max/min ratio means load imbalance. What I do not
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already
synchronized. With 32768 processors, there are 1.2e+06 messages with average
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages
and it takes 54 seconds. Another chance is the reduction at MatAssemblyEnd. I
don't know why it needs 8 reductions. In my mind, one is enough. I need to look
at the code.
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- --
Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total
Avg %Total Count %Total
0: Main Stage: 8.5045e+02 13.0% 3.0633e+15 14.0% 8.196e+07 13.1%
7.768e+06 13.1% 2.530e+02 13.0%
1: Create Basis: 7.9234e-02 0.0% 0.0000e+00 0.0% 0.000e+00 0.0%
0.000e+00 0.0% 0.000e+00 0.0%
2: Create Lattice: 8.3944e-05 0.0% 0.0000e+00 0.0% 0.000e+00 0.0%
0.000e+00 0.0% 0.000e+00 0.0%
3: Create Hamilt: 1.0694e+02 1.6% 0.0000e+00 0.0% 0.000e+00 0.0%
0.000e+00 0.0% 2.000e+00 0.1%
5: Offdiag: 1.6525e+02 2.5% 0.0000e+00 0.0% 1.188e+06 0.2%
1.942e+06 0.0% 8.000e+00 0.4%
6: Phys quantities: 5.4045e+03 82.8% 1.8866e+16 86.0% 5.417e+08 86.7%
7.768e+06 86.8% 1.674e+03 86.1%
--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 28 0 0 0 0 0
MatAssemblyBegin 1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 28 0 0 0 0 0
MatAssemblyEnd 1 1.0 5.3762e+01 1.0 0.00e+00 0.0 1.2e+06 1.9e+06
8.0e+00 1 0 0 0 0 33 0100100100 0
VecSet 1 1.0 7.5533e-02 9.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
--Junchao Zhang
On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F.
<[email protected]<mailto:[email protected]>> wrote:
The load balance is definitely out of whack.
BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMult 138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06
0.0e+00 2 4 13 13 0 15 25100100 0 2935476
MatAssemblyBegin 1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNorm 2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00
2.0e+00 1 0 0 0 0 5 0 0 0 1 12780
VecCopy 6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 41000838
VecScatterBegin 138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06
0.0e+00 1 0 13 13 0 4 0100100 0 0
VecScatterEnd 138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 4 0 0 0 0 0
VecSetRandom 1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a
balance ratio above 2 which means some processes have more than twice the work
of others. Meanwhile the ratio for anything with communication is extremely in
balanced, some processes get to the synchronization point well before other
processes.
The first thing I would do is worry about the load imbalance, what is its
cause? is it one process with much less work than others (not great but not
terrible) or is it one process with much more work then the others (terrible)
or something in between. I think once you get a handle on the load balance the
rest may fall into place, otherwise we still have some exploring to do. This is
not expected behavior for a good machine with a good network and a well
balanced job. After you understand the load balancing you may need to use one
of the parallel performance visualization tools to see why the synchronization
is out of whack.
Good luck
Barry
> On Jun 21, 2019, at 9:27 AM, Ale Foggia
> <[email protected]<mailto:[email protected]>> wrote:
>
> I'm sending one with a bit less time.
> I'm timing the functions also with std::chronos and for the case of 180
> seconds the program runs out of memory (and crushes) before the PETSc log
> gets to be printed, so I know the time only from my function. Anyway, in
> every case, the times between std::chronos and the PETSc log match.
>
> (The large times are in part "4b- Building offdiagonal part" or "Event Stage
> 5: Offdiag").
>
> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao
> (<[email protected]<mailto:[email protected]>>) escribió:
>
>
> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia
> <[email protected]<mailto:[email protected]>> wrote:
> Thanks both of you for your answers,
>
> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F.
> (<[email protected]<mailto:[email protected]>>) escribió:
>
> Note that this is a one time cost if the nonzero structure of the matrix
> stays the same. It will not happen in future MatAssemblies.
>
> > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users
> > <[email protected]<mailto:[email protected]>> wrote:
> >
> > Those messages were used to build MatMult communication pattern for the
> > matrix. They were not part of the matrix entries-passing you imagined, but
> > indeed happened in MatAssemblyEnd. If you want to make sure processors do
> > not set remote entries, you can use
> > MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an
> > error when an off-proc entry is set.
>
> I started being concerned about this when I saw that the assembly was taking
> a few hundreds of seconds in my code, like 180 seconds, which for me it's a
> considerable time. Do you think (or maybe you need more information to answer
> this) that this time is "reasonable" for communicating the pattern for the
> matrix? I already checked that I'm not setting any remote entries.
> It is not reasonable. Could you send log view of that test with 180 seconds
> MatAssembly?
>
> Also I see (in my code) that even if there are no messages being passed
> during MatAssemblyBegin, it is taking time and the "ratio" is very big.
>
> >
> >
> > --Junchao Zhang
> >
> >
> > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users
> > <[email protected]<mailto:[email protected]>> wrote:
> > Hello all!
> >
> > During the conference I showed you a problem happening during
> > MatAssemblyEnd in a particular code that I have. Now, I tried the same with
> > a simple code (a symmetric problem corresponding to the Laplacian operator
> > in 1D, from the SLEPc Hands-On exercises). As I understand (and please,
> > correct me if I'm wrong), in this case the elements of the matrix are
> > computed locally by each process so there should not be any communication
> > during the assembly. However, in the log I get that there are messages
> > being passed. Also, the number of messages changes with the number of
> > processes used and the size of the matrix. Could you please help me
> > understand this?
> >
> > I attach the code I used and the log I get for a small problem.
> >
> > Cheers,
> > Ale
> >
>
> <log.txt>