Attached are the job output files (which include -log_view) for SNES ex48 run on a single haswell and knl node (32 and 64 cores respectively). Started off with a coarse grid of size 40x40x5 and ran three different tests with -da_refine 1/2/3 and -pc_type mg
What's interesting/strange is that if i try to do -da_refine 4 on KNL, i get a slurm error that says: "slurmstepd: error: Step 4408401.0 exceeded memory limit (96737652 > 94371840), being killed" but it runs perfectly fine on Haswell. Adding -pc_mg_levels 7 enables KNL to run on -da_refine 4 but the performance still does not beat out haswell. The performance spectrum (dofs/sec) for 1-3 levels of refinement looks like this: Haswell: 2.416e+03 1.490e+04 5.188e+04 KNL: 9.308e+02 7.257e+03 3.838e+04 Which might suggest to me that KNL performs better with larger problem sizes. On Tue, Apr 4, 2017 at 11:05 AM, Matthew Knepley <knep...@gmail.com> wrote: > On Tue, Apr 4, 2017 at 10:57 AM, Justin Chang <jychan...@gmail.com> wrote: > >> Thanks everyone for the helpful advice. So I tried all the suggestions >> including using libsci. The performance did not improve for my particular >> runs, which I think suggests the problem parameters chosen for my tests >> (SNES ex48) are not optimal for KNL. Does anyone have example test runs I >> could reproduce that compare the performance between KNL and >> Haswell/Ivybridge/etc? >> > > Lets try to see what is going on with your existing data first. > > First, I think that main thing is to make sure we are using MCDRAM. > Everything else in KNL > is window dressing (IMHO). All we have to look at is something like MAXPY. > You can get the > bandwidth estimate from the flop rate and problem size (I think), and we > can at least get > bandwidth ratios between Haswell and KNL with that number. > > Matt > > >> On Mon, Apr 3, 2017 at 3:06 PM Richard Mills <richardtmi...@gmail.com> >> wrote: >> >>> Yes, one should rely on MKL (or Cray LibSci, if using the Cray >>> toolchain) on Cori. But I'm guessing that this will make no noticeable >>> difference for what Justin is doing. >>> >>> --Richard >>> >>> On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <kec...@gmail.com> wrote: >>> >>> How about replacing --download-fblaslapack with vendor specific >>> BLAS/LAPACK? >>> >>> Murat >>> >>> On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills <richardtmi...@gmail.com> >>> wrote: >>> >>> On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong <hongzh...@anl.gov> wrote: >>> >>> >>> On Apr 3, 2017, at 1:44 PM, Justin Chang <jychan...@gmail.com> wrote: >>> >>> Richard, >>> >>> This is what my job script looks like: >>> >>> #!/bin/bash >>> #SBATCH -N 16 >>> #SBATCH -C knl,quad,flat >>> #SBATCH -p regular >>> #SBATCH -J knlflat1024 >>> #SBATCH -L SCRATCH >>> #SBATCH -o knlflat1024.o%j >>> #SBATCH --mail-type=ALL >>> #SBATCH --mail-user=jychan...@gmail.com >>> #SBATCH -t 00:20:00 >>> >>> #run the application: >>> cd $SCRATCH/Icesheet >>> sbcast --compress=lz4 ./ex48cori /tmp/ex48cori >>> srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128 -N >>> 128 -P 16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg -da_refine >>> 1 >>> >>> >>> Maybe it is a typo. It should be numactl -m 1. >>> >>> >>> "-p 1" will also work. "-p" means to "prefer" NUMA node 1 (the MCDRAM), >>> whereas "-m" means to use only NUMA node 1. In the former case, MCDRAM >>> will be used for allocations until the available memory there has been >>> exhausted, and then things will spill over into the DRAM. One would think >>> that "-m" would be better for doing performance studies, but on systems >>> where the nodes have swap space enabled, you can get terrible performance >>> if your code's working set exceeds the size of the MCDRAM, as the system >>> will obediently obey your wishes to not use the DRAM and go straight to the >>> swap disk! I assume the Cori nodes don't have swap space, though I could >>> be wrong. >>> >>> >>> According to the NERSC info pages, they say to add the "numactl" if >>> using flat mode. Previously I tried cache mode but the performance seems to >>> be unaffected. >>> >>> >>> Using cache mode should give similar performance as using flat mode with >>> the numactl option. But both approaches should be significant faster than >>> using flat mode without the numactl option. I usually see over 3X speedup. >>> You can also do such comparison to see if the high-bandwidth memory is >>> working properly. >>> >>> I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is >>> nearly 4-5x faster. Though I suspect this drastic change has much to do >>> with the initial coarse grid size now being extremely small. >>> >>> I think you may be right about why you see such a big difference. The >>> KNL nodes need enough work to be able to use the SIMD lanes effectively. >>> Also, if your problem gets small enough, then it's going to be able to fit >>> in the Haswell's L3 cache. Although KNL has MCDRAM and this delivers *a >>> lot* more memory bandwidth than the DDR4 memory, it will deliver a lot less >>> bandwidth than the Haswell's L3. >>> >>> I'll give the COPTFLAGS a try and see what happens >>> >>> >>> Make sure to use --with-memalign=64 for data alignment when configuring >>> PETSc. >>> >>> >>> Ah, yes, I forgot that. Thanks for mentioning it, Hong! >>> >>> >>> The option -xMIC-AVX512 would improve the vectorization performance. But >>> it may cause problems for the MPIBAIJ format for some unknown reason. >>> MPIAIJ should work fine with this option. >>> >>> >>> Hmm. Try both, and, if you see worse performance with MPIBAIJ, let us >>> know and I'll try to figure this out. >>> >>> --Richard >>> >>> >>> >>> Hong (Mr.) >>> >>> Thanks, >>> Justin >>> >>> On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills <richardtmi...@gmail.com> >>> wrote: >>> >>> Hi Justin, >>> >>> How is the MCDRAM (on-package "high-bandwidth memory") configured for >>> your KNL runs? And if it is in "flat" mode, what are you doing to ensure >>> that you use the MCDRAM? Doing this wrong seems to be one of the most >>> common reasons for unexpected poor performance on KNL. >>> >>> I'm not that familiar with the environment on Cori, but I think that if >>> you are building for KNL, you should add "-xMIC-AVX512" to your compiler >>> flags to explicitly instruct the compiler to use the AVX512 instruction >>> set. I usually use something along the lines of >>> >>> 'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512' >>> >>> (The "-g" just adds symbols, which make the output from performance >>> profiling tools much more useful.) >>> >>> That said, I think that if you are comparing 1024 Haswell cores vs. 1024 >>> KNL cores (so double the number of Haswell nodes), I'm not surprised that >>> the simulations are almost twice as fast using the Haswell nodes. Keep in >>> mind that individual KNL cores are much less powerful than an individual >>> Haswell node. You are also using roughly twice the power footprint (dual >>> socket Haswell node should be roughly equivalent to a KNL node, I >>> believe). How do things look on when you compare equal nodes? >>> >>> Cheers, >>> Richard >>> >>> On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang <jychan...@gmail.com> >>> wrote: >>> >>> Hi all, >>> >>> On NERSC's Cori I have the following configure options for PETSc: >>> >>> ./configure --download-fblaslapack --with-cc=cc --with-clib-autodetect=0 >>> --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn >>> --with-fortranlib-autodetect=0 --with-mpiexec=srun --with-64-bit-indices=1 >>> COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt >>> >>> Where I swapped out the default Intel programming environment with that >>> of Cray (e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3'). I >>> want to document the performance difference between Cori's Haswell and KNL >>> processors. >>> >>> When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell and >>> 16 KNL nodes), the simulations are almost twice as fast on Haswell nodes. >>> Which leads me to suspect that I am not doing something right for KNL. Does >>> anyone know what are some "optimal" configure options for running PETSc on >>> KNL? >>> >>> Thanks, >>> Justin >>> >>> >>> >>> >>> >>> >>> >>> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener >
testhas_flat_1node.o4407087
Description: Binary data
testknl_flat_1node.o4407080
Description: Binary data