I did some quick tests (with a different example) on a single KNL node and a 
single Haswell node, both using 4 processes. Check below for the results about 
MatMult. And the total running time on KNL is a bit more than two times of that 
on Haswell. So I think the results Justin got with SNE ex48 are reasonable, 
considering the fact that KNL cores are much less powerful than Haswell cores, 
as Richard mentioned.

------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                            
 --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
MatMult(KNL)        1609 1.0 1.4044e+02 1.0 6.41e+10 1.0 1.3e+04 3.3e+04 
0.0e+00  18  19  91  93   0  18  19  91  93   0  1826

MatMult(Haswell)    1609 1.0 4.4927e+01 1.0 6.41e+10 1.0 1.3e+04 3.3e+04 
0.0e+00  18  19  91  93   0  18  19  91  93   0  5708

Hong(Mr.)

On Apr 4, 2017, at 11:05 AM, Matthew Knepley 
<knep...@gmail.com<mailto:knep...@gmail.com>> wrote:

On Tue, Apr 4, 2017 at 10:57 AM, Justin Chang 
<jychan...@gmail.com<mailto:jychan...@gmail.com>> wrote:
Thanks everyone for the helpful advice. So I tried all the suggestions 
including using libsci. The performance did not improve for my particular runs, 
which I think suggests the problem parameters chosen for my tests (SNES ex48) 
are not optimal for KNL. Does anyone have example test runs I could reproduce 
that compare the performance between KNL and Haswell/Ivybridge/etc?

Lets try to see what is going on with your existing data first.

First, I think that main thing is to make sure we are using MCDRAM. Everything 
else in KNL
is window dressing (IMHO). All we have to look at is something like MAXPY. You 
can get the
bandwidth estimate from the flop rate and problem size (I think), and we can at 
least get
bandwidth ratios between Haswell and KNL with that number.

   Matt

On Mon, Apr 3, 2017 at 3:06 PM Richard Mills 
<richardtmi...@gmail.com<mailto:richardtmi...@gmail.com>> wrote:
Yes, one should rely on MKL (or Cray LibSci, if using the Cray toolchain) on 
Cori.  But I'm guessing that this will make no noticeable difference for what 
Justin is doing.

--Richard

On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli 
<kec...@gmail.com<mailto:kec...@gmail.com>> wrote:
How about replacing --download-fblaslapack with vendor specific BLAS/LAPACK?

Murat

On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills 
<richardtmi...@gmail.com<mailto:richardtmi...@gmail.com>> wrote:
On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong 
<hongzh...@anl.gov<mailto:hongzh...@anl.gov>> wrote:

On Apr 3, 2017, at 1:44 PM, Justin Chang 
<jychan...@gmail.com<mailto:jychan...@gmail.com>> wrote:

Richard,

This is what my job script looks like:

#!/bin/bash
#SBATCH -N 16
#SBATCH -C knl,quad,flat
#SBATCH -p regular
#SBATCH -J knlflat1024
#SBATCH -L SCRATCH
#SBATCH -o knlflat1024.o%j
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jychan...@gmail.com<mailto:jychan...@gmail.com>
#SBATCH -t 00:20:00

#run the application:
cd $SCRATCH/Icesheet
sbcast --compress=lz4 ./ex48cori /tmp/ex48cori
srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128 -N 128 -P 
16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg -da_refine 1


Maybe it is a typo. It should be numactl -m 1.

"-p 1" will also work.  "-p" means to "prefer" NUMA node 1 (the MCDRAM), 
whereas "-m" means to use only NUMA node 1.  In the former case, MCDRAM will be 
used for allocations until the available memory there has been exhausted, and 
then things will spill over into the DRAM.  One would think that "-m" would be 
better for doing performance studies, but on systems where the nodes have swap 
space enabled, you can get terrible performance if your code's working set 
exceeds the size of the MCDRAM, as the system will obediently obey your wishes 
to not use the DRAM and go straight to the swap disk!  I assume the Cori nodes 
don't have swap space, though I could be wrong.


According to the NERSC info pages, they say to add the "numactl" if using flat 
mode. Previously I tried cache mode but the performance seems to be unaffected.

Using cache mode should give similar performance as using flat mode with the 
numactl option. But both approaches should be significant faster than using 
flat mode without the numactl option. I usually see over 3X speedup. You can 
also do such comparison to see if the high-bandwidth memory is working properly.

I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is nearly 4-5x 
faster. Though I suspect this drastic change has much to do with the initial 
coarse grid size now being extremely small.
I think you may be right about why you see such a big difference.  The KNL 
nodes need enough work to be able to use the SIMD lanes effectively.  Also, if 
your problem gets small enough, then it's going to be able to fit in the 
Haswell's L3 cache.  Although KNL has MCDRAM and this delivers *a lot* more 
memory bandwidth than the DDR4 memory, it will deliver a lot less bandwidth 
than the Haswell's L3.
I'll give the COPTFLAGS a try and see what happens

Make sure to use --with-memalign=64 for data alignment when configuring PETSc.

Ah, yes, I forgot that.  Thanks for mentioning it, Hong!


The option -xMIC-AVX512 would improve the vectorization performance. But it may 
cause problems for the MPIBAIJ format for some unknown reason. MPIAIJ should 
work fine with this option.

Hmm.  Try both, and, if you see worse performance with MPIBAIJ, let us know and 
I'll try to figure this out.

--Richard


Hong (Mr.)

Thanks,
Justin

On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills 
<richardtmi...@gmail.com<mailto:richardtmi...@gmail.com>> wrote:
Hi Justin,

How is the MCDRAM (on-package "high-bandwidth memory") configured for your KNL 
runs?  And if it is in "flat" mode, what are you doing to ensure that you use 
the MCDRAM?  Doing this wrong seems to be one of the most common reasons for 
unexpected poor performance on KNL.

I'm not that familiar with the environment on Cori, but I think that if you are 
building for KNL, you should add "-xMIC-AVX512" to your compiler flags to 
explicitly instruct the compiler to use the AVX512 instruction set.  I usually 
use something along the lines of

  'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'

(The "-g" just adds symbols, which make the output from performance profiling 
tools much more useful.)

That said, I think that if you are comparing 1024 Haswell cores vs. 1024 KNL 
cores (so double the number of Haswell nodes), I'm not surprised that the 
simulations are almost twice as fast using the Haswell nodes.  Keep in mind 
that individual KNL cores are much less powerful than an individual Haswell 
node.  You are also using roughly twice the power footprint (dual socket 
Haswell node should be roughly equivalent to a KNL node, I believe).  How do 
things look on when you compare equal nodes?

Cheers,
Richard

On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang 
<jychan...@gmail.com<mailto:jychan...@gmail.com>> wrote:
Hi all,

On NERSC's Cori I have the following configure options for PETSc:

./configure --download-fblaslapack --with-cc=cc --with-clib-autodetect=0 
--with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn 
--with-fortranlib-autodetect=0 --with-mpiexec=srun --with-64-bit-indices=1 
COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt

Where I swapped out the default Intel programming environment with that of Cray 
(e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3'). I want to 
document the performance difference between Cori's Haswell and KNL processors.

When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell and 16 KNL 
nodes), the simulations are almost twice as fast on Haswell nodes. Which leads 
me to suspect that I am not doing something right for KNL. Does anyone know 
what are some "optimal" configure options for running PETSc on KNL?

Thanks,
Justin









--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

Reply via email to