Re: [petsc-users] Configuring PETSc for KNL

Richard Mills Mon, 03 Apr 2017 13:06:47 -0700

Yes, one should rely on MKL (or Cray LibSci, if using the Cray toolchain)
on Cori.  But I'm guessing that this will make no noticeable difference for
what Justin is doing.


--Richard

On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <[email protected]> wrote:

> How about replacing --download-fblaslapack with vendor specific
> BLAS/LAPACK?
>
> Murat
>
> On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills <[email protected]>
> wrote:
>
>> On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong <[email protected]> wrote:
>>
>>>
>>> On Apr 3, 2017, at 1:44 PM, Justin Chang <[email protected]> wrote:
>>>
>>> Richard,
>>>
>>> This is what my job script looks like:
>>>
>>> #!/bin/bash
>>> #SBATCH -N 16
>>> #SBATCH -C knl,quad,flat
>>> #SBATCH -p regular
>>> #SBATCH -J knlflat1024
>>> #SBATCH -L SCRATCH
>>> #SBATCH -o knlflat1024.o%j
>>> #SBATCH --mail-type=ALL
>>> #SBATCH [email protected]
>>> #SBATCH -t 00:20:00
>>>
>>> #run the application:
>>> cd $SCRATCH/Icesheet
>>> sbcast --compress=lz4 ./ex48cori /tmp/ex48cori
>>> srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128 -N
>>> 128 -P 16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg -da_refine
>>> 1
>>>
>>>
>>> Maybe it is a typo. It should be numactl -m 1.
>>>
>>
>> "-p 1" will also work.  "-p" means to "prefer" NUMA node 1 (the MCDRAM),
>> whereas "-m" means to use only NUMA node 1.  In the former case, MCDRAM
>> will be used for allocations until the available memory there has been
>> exhausted, and then things will spill over into the DRAM.  One would think
>> that "-m" would be better for doing performance studies, but on systems
>> where the nodes have swap space enabled, you can get terrible performance
>> if your code's working set exceeds the size of the MCDRAM, as the system
>> will obediently obey your wishes to not use the DRAM and go straight to the
>> swap disk!  I assume the Cori nodes don't have swap space, though I could
>> be wrong.
>>
>>
>>> According to the NERSC info pages, they say to add the "numactl" if
>>> using flat mode. Previously I tried cache mode but the performance seems to
>>> be unaffected.
>>>
>>>
>>> Using cache mode should give similar performance as using flat mode with
>>> the numactl option. But both approaches should be significant faster than
>>> using flat mode without the numactl option. I usually see over 3X speedup.
>>> You can also do such comparison to see if the high-bandwidth memory is
>>> working properly.
>>>
>>> I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is
>>> nearly 4-5x faster. Though I suspect this drastic change has much to do
>>> with the initial coarse grid size now being extremely small.
>>>
>>> I think you may be right about why you see such a big difference.  The
>> KNL nodes need enough work to be able to use the SIMD lanes effectively.
>> Also, if your problem gets small enough, then it's going to be able to fit
>> in the Haswell's L3 cache.  Although KNL has MCDRAM and this delivers *a
>> lot* more memory bandwidth than the DDR4 memory, it will deliver a lot less
>> bandwidth than the Haswell's L3.
>>
>>> I'll give the COPTFLAGS a try and see what happens
>>>
>>>
>>> Make sure to use --with-memalign=64 for data alignment when configuring
>>> PETSc.
>>>
>>
>> Ah, yes, I forgot that.  Thanks for mentioning it, Hong!
>>
>>
>>> The option -xMIC-AVX512 would improve the vectorization performance. But
>>> it may cause problems for the MPIBAIJ format for some unknown reason.
>>> MPIAIJ should work fine with this option.
>>>
>>
>> Hmm.  Try both, and, if you see worse performance with MPIBAIJ, let us
>> know and I'll try to figure this out.
>>
>> --Richard
>>
>>
>>>
>>> Hong (Mr.)
>>>
>>> Thanks,
>>> Justin
>>>
>>> On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills <[email protected]>
>>> wrote:
>>>
>>>> Hi Justin,
>>>>
>>>> How is the MCDRAM (on-package "high-bandwidth memory") configured for
>>>> your KNL runs?  And if it is in "flat" mode, what are you doing to ensure
>>>> that you use the MCDRAM?  Doing this wrong seems to be one of the most
>>>> common reasons for unexpected poor performance on KNL.
>>>>
>>>> I'm not that familiar with the environment on Cori, but I think that if
>>>> you are building for KNL, you should add "-xMIC-AVX512" to your compiler
>>>> flags to explicitly instruct the compiler to use the AVX512 instruction
>>>> set.  I usually use something along the lines of
>>>>
>>>>   'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'
>>>>
>>>> (The "-g" just adds symbols, which make the output from performance
>>>> profiling tools much more useful.)
>>>>
>>>> That said, I think that if you are comparing 1024 Haswell cores vs.
>>>> 1024 KNL cores (so double the number of Haswell nodes), I'm not surprised
>>>> that the simulations are almost twice as fast using the Haswell nodes.
>>>> Keep in mind that individual KNL cores are much less powerful than an
>>>> individual Haswell node.  You are also using roughly twice the power
>>>> footprint (dual socket Haswell node should be roughly equivalent to a KNL
>>>> node, I believe).  How do things look on when you compare equal nodes?
>>>>
>>>> Cheers,
>>>> Richard
>>>>
>>>> On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> On NERSC's Cori I have the following configure options for PETSc:
>>>>>
>>>>> ./configure --download-fblaslapack --with-cc=cc
>>>>> --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0
>>>>> --with-debugging=0 --with-fc=ftn --with-fortranlib-autodetect=0
>>>>> --with-mpiexec=srun --with-64-bit-indices=1 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>>>>> FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt
>>>>>
>>>>> Where I swapped out the default Intel programming environment with
>>>>> that of Cray (e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3').
>>>>> I want to document the performance difference between Cori's Haswell and
>>>>> KNL processors.
>>>>>
>>>>> When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell
>>>>> and 16 KNL nodes), the simulations are almost twice as fast on Haswell
>>>>> nodes. Which leads me to suspect that I am not doing something right for
>>>>> KNL. Does anyone know what are some "optimal" configure options for 
>>>>> running
>>>>> PETSc on KNL?
>>>>>
>>>>> Thanks,
>>>>> Justin
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: [petsc-users] Configuring PETSc for KNL

Reply via email to