Hey,

here's some data on what you should see with STREAM when comparing against conventional XEONs:
https://www.karlrupp.net/2016/07/knights-landing-vs-knights-corner-haswell-ivy-bridge-and-sandy-bridge-stream-benchmark-results/

Note that MCDRAM only pays off if you can keep enough cores busy. Thus, anything below 16 processes is unlikely to give you any benefit. Also, your working set must be large enough not to stay in L3 on Haswell (I think this was already mentioned earlier in this thread).

Best regards,
Karli



On 04/04/2017 06:05 PM, Matthew Knepley wrote:
On Tue, Apr 4, 2017 at 10:57 AM, Justin Chang <[email protected]
<mailto:[email protected]>> wrote:

    Thanks everyone for the helpful advice. So I tried all the
    suggestions including using libsci. The performance did not improve
    for my particular runs, which I think suggests the problem
    parameters chosen for my tests (SNES ex48) are not optimal for KNL.
    Does anyone have example test runs I could reproduce that compare
    the performance between KNL and Haswell/Ivybridge/etc?


Lets try to see what is going on with your existing data first.

First, I think that main thing is to make sure we are using MCDRAM.
Everything else in KNL
is window dressing (IMHO). All we have to look at is something like
MAXPY. You can get the
bandwidth estimate from the flop rate and problem size (I think), and we
can at least get
bandwidth ratios between Haswell and KNL with that number.

   Matt


    On Mon, Apr 3, 2017 at 3:06 PM Richard Mills
    <[email protected] <mailto:[email protected]>> wrote:

        Yes, one should rely on MKL (or Cray LibSci, if using the Cray
        toolchain) on Cori.  But I'm guessing that this will make no
        noticeable difference for what Justin is doing.

        --Richard

        On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <[email protected]
        <mailto:[email protected]>> wrote:

            How about replacing --download-fblaslapack with vendor
            specific BLAS/LAPACK?

            Murat

            On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills
            <[email protected] <mailto:[email protected]>>
            wrote:

                On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong
                <[email protected] <mailto:[email protected]>> wrote:


                    On Apr 3, 2017, at 1:44 PM, Justin Chang
                    <[email protected] <mailto:[email protected]>>
                    wrote:

                    Richard,

                    This is what my job script looks like:

                    #!/bin/bash
                    #SBATCH -N 16
                    #SBATCH -C knl,quad,flat
                    #SBATCH -p regular
                    #SBATCH -J knlflat1024
                    #SBATCH -L SCRATCH
                    #SBATCH -o knlflat1024.o%j
                    #SBATCH --mail-type=ALL
                    #SBATCH [email protected]
                    <mailto:[email protected]>
                    #SBATCH -t 00:20:00

                    #run the application:
                    cd $SCRATCH/Icesheet
                    sbcast --compress=lz4 ./ex48cori /tmp/ex48cori
                    srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1
                    /tmp/ex48cori -M 128 -N 128 -P 16 -thi_mat_type
                    baij -pc_type mg -mg_coarse_pc_type gamg -da_refine 1


                    Maybe it is a typo. It should be numactl -m 1.


                "-p 1" will also work.  "-p" means to "prefer" NUMA node
                1 (the MCDRAM), whereas "-m" means to use only NUMA node
                1.  In the former case, MCDRAM will be used for
                allocations until the available memory there has been
                exhausted, and then things will spill over into the
                DRAM.  One would think that "-m" would be better for
                doing performance studies, but on systems where the
                nodes have swap space enabled, you can get terrible
                performance if your code's working set exceeds the size
                of the MCDRAM, as the system will obediently obey your
                wishes to not use the DRAM and go straight to the swap
                disk!  I assume the Cori nodes don't have swap space,
                though I could be wrong.


                    According to the NERSC info pages, they say to add
                    the "numactl" if using flat mode. Previously I
                    tried cache mode but the performance seems to be
                    unaffected.

                    Using cache mode should give similar performance as
                    using flat mode with the numactl option. But both
                    approaches should be significant faster than using
                    flat mode without the numactl option. I usually see
                    over 3X speedup. You can also do such comparison to
                    see if the high-bandwidth memory is working properly.

                    I also comparerd 256 haswell nodes vs 256 KNL
                    nodes and haswell is nearly 4-5x faster. Though I
                    suspect this drastic change has much to do with
                    the initial coarse grid size now being extremely
                    small.

                I think you may be right about why you see such a big
                difference.  The KNL nodes need enough work to be able
                to use the SIMD lanes effectively.  Also, if your
                problem gets small enough, then it's going to be able to
                fit in the Haswell's L3 cache.  Although KNL has MCDRAM
                and this delivers *a lot* more memory bandwidth than the
                DDR4 memory, it will deliver a lot less bandwidth than
                the Haswell's L3.

                    I'll give the COPTFLAGS a try and see what happens

                    Make sure to use --with-memalign=64 for data
                    alignment when configuring PETSc.


                Ah, yes, I forgot that.  Thanks for mentioning it, Hong!


                    The option -xMIC-AVX512 would improve the
                    vectorization performance. But it may cause problems
                    for the MPIBAIJ format for some unknown reason.
                    MPIAIJ should work fine with this option.


                Hmm.  Try both, and, if you see worse performance with
                MPIBAIJ, let us know and I'll try to figure this out.

                --Richard



                    Hong (Mr.)

                    Thanks,
                    Justin

                    On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills
                    <[email protected]
                    <mailto:[email protected]>> wrote:

                        Hi Justin,

                        How is the MCDRAM (on-package "high-bandwidth
                        memory") configured for your KNL runs?  And if
                        it is in "flat" mode, what are you doing to
                        ensure that you use the MCDRAM?  Doing this
                        wrong seems to be one of the most common
                        reasons for unexpected poor performance on KNL.

                        I'm not that familiar with the environment on
                        Cori, but I think that if you are building for
                        KNL, you should add "-xMIC-AVX512" to your
                        compiler flags to explicitly instruct the
                        compiler to use the AVX512 instruction set.  I
                        usually use something along the lines of

                          'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'

                        (The "-g" just adds symbols, which make the
                        output from performance profiling tools much
                        more useful.)

                        That said, I think that if you are comparing
                        1024 Haswell cores vs. 1024 KNL cores (so
                        double the number of Haswell nodes), I'm not
                        surprised that the simulations are almost
                        twice as fast using the Haswell nodes.  Keep
                        in mind that individual KNL cores are much
                        less powerful than an individual Haswell
                        node.  You are also using roughly twice the
                        power footprint (dual socket Haswell node
                        should be roughly equivalent to a KNL node, I
                        believe).  How do things look on when you
                        compare equal nodes?

                        Cheers,
                        Richard

                        On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang
                        <[email protected]
                        <mailto:[email protected]>> wrote:

                            Hi all,

                            On NERSC's Cori I have the following
                            configure options for PETSc:

                            ./configure --download-fblaslapack
                            --with-cc=cc --with-clib-autodetect=0
                            --with-cxx=CC --with-cxxlib-autodetect=0
                            --with-debugging=0 --with-fc=ftn
                            --with-fortranlib-autodetect=0
                            --with-mpiexec=srun
                            --with-64-bit-indices=1 COPTFLAGS=-O3
                            CXXOPTFLAGS=-O3 FOPTFLAGS=-O3
                            PETSC_ARCH=arch-cori-opt

                            Where I swapped out the default Intel
                            programming environment with that of Cray
                            (e.g., 'module switch PrgEnv-intel/6.0.3
                            PrgEnv-cray/6.0.3'). I want to document
                            the performance difference between Cori's
                            Haswell and KNL processors.

                            When I run a PETSc example like SNES ex48
                            on 1024 cores (32 Haswell and 16 KNL
                            nodes), the simulations are almost twice
                            as fast on Haswell nodes. Which leads me
                            to suspect that I am not doing something
                            right for KNL. Does anyone know what are
                            some "optimal" configure options for
                            running PETSc on KNL?

                            Thanks,
                            Justin










--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener

Reply via email to