Re: [petsc-users] Scaling with number of cores

TAY wee-beng Tue, 03 Nov 2015 07:23:34 -0800


On 3/11/2015 9:01 PM, Matthew Knepley wrote:

On Tue, Nov 3, 2015 at 6:58 AM, TAY wee-beng <[email protected]<mailto:[email protected]>> wrote:



    On 3/11/2015 8:52 PM, Matthew Knepley wrote:

    On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        I tried and have attached the log.

        Ya, my Poisson eqn has Neumann boundary condition. Do I need
        to specify some null space stuff?  Like KSPSetNullSpace or
        MatNullSpaceCreate?


    Yes, you need to attach the constant null space to the matrix.

      Thanks,

         Matt

    Ok so can you point me to a suitable example so that I know which
    one to use specifically?


https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761

  Matt

Ok did a search and found the ans for the MatNullSpaceCreate:

http://petsc-users.mcs.anl.narkive.com/jtIlVll0/pass-petsc-null-integer-to-dynamic-array-of-vec-in-frotran90

    Thanks.



        Thank you

        Yours sincerely,

        TAY wee-beng

        On 3/11/2015 12:45 PM, Barry Smith wrote:

                On Nov 2, 2015, at 10:37 PM, TAY
                wee-beng<[email protected] <mailto:[email protected]>>
                wrote:

                Hi,

                I tried :

                1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg

                2. -poisson_pc_type gamg

                Run with -poisson_ksp_monitor_true_residual
            -poisson_ksp_monitor_converged_reason
            Does your poisson have Neumann boundary conditions? Do
            you have any zeros on the diagonal for the matrix (you
            shouldn't).

               There may be something wrong with your poisson
            discretization that was also messing up hypre



                Both options give:

                    1      0.00150000      0.00000000     0.00000000
                1.00000000  NaN             NaN             NaN
                M Diverged but why?, time = 2
                reason =           -9

                How can I check what's wrong?

                Thank you

                Yours sincerely,

                TAY wee-beng

                On 3/11/2015 3:18 AM, Barry Smith wrote:

                        hypre is just not scaling well here. I do not
                    know why. Since hypre is a block box for us there
                    is no way to determine why the poor scaling.

                        If you make the same two runs with -pc_type
                    gamg there will be a lot more information in the
                    log summary about in what routines it is scaling
                    well or poorly.

                       Barry



                        On Nov 2, 2015, at 3:17 AM, TAY
                        wee-beng<[email protected]
                        <mailto:[email protected]>> wrote:

                        Hi,

                        I have attached the 2 files.

                        Thank you

                        Yours sincerely,

                        TAY wee-beng

                        On 2/11/2015 2:55 PM, Barry Smith wrote:

                               Run (158/2)x(266/2)x(150/2) grid on 8
                            processes  and then (158)x(266)x(150) on
                            64 processors  and send the two
                            -log_summary results

                               Barry


                                On Nov 2, 2015, at 12:19 AM, TAY
                                wee-beng<[email protected]
                                <mailto:[email protected]>> wrote:

                                Hi,

                                I have attached the new results.

                                Thank you

                                Yours sincerely,

                                TAY wee-beng

                                On 2/11/2015 12:27 PM, Barry Smith wrote:

                                       Run without the
                                    -momentum_ksp_view
                                    -poisson_ksp_view and send the
                                    new results


                                       You can see from the log
                                    summary that the PCSetUp is
                                    taking a much smaller percentage
                                    of the time meaning that it is
                                    reusing the preconditioner and
                                    not rebuilding it each time.

                                    Barry

                                       Something makes no sense with
                                    the output: it gives

                                    KSPSolve             199 1.0
                                    2.3298e+03 1.0 5.20e+09 1.8
                                    3.8e+04 9.9e+05 5.0e+02 90100
                                    66100 24  90100 66100 24   165

                                    90% of the time is in the solve
                                    but there is no significant
                                    amount of time in other events of
                                    the code which is just not
                                    possible. I hope it is due to
                                    your IO.



                                        On Nov 1, 2015, at 10:02 PM,
                                        TAY wee-beng<[email protected]
                                        <mailto:[email protected]>> wrote:

                                        Hi,

                                        I have attached the new run
                                        with 100 time steps for 48
                                        and 96 cores.

                                        Only the Poisson eqn 's RHS
                                        changes, the LHS doesn't. So
                                        if I want to reuse the
                                        preconditioner, what must I
                                        do? Or what must I not do?

                                        Why does the number of
                                        processes increase so much?
                                        Is there something wrong with
                                        my coding? Seems to be so too
                                        for my new run.

                                        Thank you

                                        Yours sincerely,

                                        TAY wee-beng

                                        On 2/11/2015 9:49 AM, Barry
                                        Smith wrote:

                                               If you are doing many
                                            time steps with the same
                                            linear solver then you
                                            MUST do your weak scaling
                                            studies with MANY time
                                            steps since the setup
                                            time of AMG only takes
                                            place in the first
                                            stimestep. So run both 48
                                            and 96 processes with the
                                            same large number of time
                                            steps.

                                               Barry



                                                On Nov 1, 2015, at
                                                7:35 PM, TAY
                                                wee-beng<[email protected]
                                                <mailto:[email protected]>>
                                                wrote:

                                                Hi,

                                                Sorry I forgot and
                                                use the old a.out. I
                                                have attached the new
                                                log for 48cores
                                                (log48), together
                                                with the 96cores log
                                                (log96).

                                                Why does the number
                                                of processes increase
                                                so much? Is there
                                                something wrong with
                                                my coding?

                                                Only the Poisson eqn
                                                's RHS changes, the
                                                LHS doesn't. So if I
                                                want to reuse the
                                                preconditioner, what
                                                must I do? Or what
                                                must I not do?

                                                Lastly, I only
                                                simulated 2 time
                                                steps previously. Now
                                                I run for 10
                                                timesteps (log48_10).
                                                Is it building the
                                                preconditioner at
                                                every timestep?

                                                Also, what about
                                                momentum eqn? Is it
                                                working well?

                                                I will try the gamg
                                                later too.

                                                Thank you

                                                Yours sincerely,

                                                TAY wee-beng

                                                On 2/11/2015 12:30
                                                AM, Barry Smith wrote:

                                                       You used gmres
                                                    with 48 processes
                                                    but richardson
                                                    with 96. You need
                                                    to be careful and
                                                    make sure you
                                                    don't change the
                                                    solvers when you
                                                    change the number
                                                    of processors
                                                    since you can get
                                                    very different
                                                    inconsistent results

                                                        Anyways all
                                                    the time is being
                                                    spent in the
                                                    BoomerAMG
                                                    algebraic
                                                    multigrid setup
                                                    and it is is
                                                    scaling badly.
                                                    When you double
                                                    the problem size
                                                    and number of
                                                    processes it went
                                                    from 3.2445e+01
                                                    to 4.3599e+02
                                                    seconds.

                                                    PCSetUp   3 1.0
                                                    3.2445e+01 1.0
                                                    9.58e+06 2.0
                                                    0.0e+00 0.0e+00

4.0e+00 62 8 00 4 62 8 0 05 11


                                                    PCSetUp   3 1.0
                                                    4.3599e+02 1.0
                                                    9.58e+06 2.0
                                                    0.0e+00 0.0e+00

4.0e+00 85 18 00 6 85 18 0 06 2


                                                       Now is the
                                                    Poisson problem
                                                    changing at each
                                                    timestep or can
                                                    you use the same
                                                    preconditioner
                                                    built with
                                                    BoomerAMG for all
                                                    the time steps?
                                                    Algebraic
                                                    multigrid has a
                                                    large set up time
                                                    that you often
                                                    doesn't matter if
                                                    you have many
                                                    time steps but if
                                                    you have to
                                                    rebuild it each
                                                    timestep it is
                                                    too large?

                                                       You might also
                                                    try -pc_type gamg
                                                    and see how
                                                    PETSc's algebraic
                                                    multigrid scales
                                                    for your
                                                    problem/machine.

                                                       Barry



                                                        On Nov 1,
                                                        2015, at 7:30
                                                        AM, TAY
                                                        
wee-beng<[email protected]
                                                        
<mailto:[email protected]>>
                                                        wrote:


                                                        On 1/11/2015
                                                        10:00 AM,
                                                        Barry Smith
                                                        wrote:

                                                                On
                                                                Oct
                                                                31,
                                                                2015,
                                                                at
                                                                8:43
                                                                PM,
                                                                TAY
                                                                
wee-beng<[email protected]
                                                                
<mailto:[email protected]>>
                                                                wrote:


                                                                On
                                                                1/11/2015
                                                                12:47
                                                                AM,
                                                                Matthew
                                                                Knepley
                                                                wrote:

                                                                    On Sat,
                                                                    Oct
                                                                    31,
                                                                    2015
                                                                    at 11:34
                                                                    AM,
                                                                    TAY
                                                                    
wee-beng<[email protected]
                                                                    
<mailto:[email protected]>>
                                                                    wrote:
                                                                    Hi,

                                                                    I
                                                                    understand
                                                                    that
                                                                    as mentioned
                                                                    in the
                                                                    faq,
                                                                    due
                                                                    to the
                                                                    limitations
                                                                    in memory,
                                                                    the
                                                                    scaling
                                                                    is not
                                                                    linear.
                                                                    So,
                                                                    I
                                                                    am trying
                                                                    to write
                                                                    a
                                                                    proposal
                                                                    to use
                                                                    a
                                                                    
supercomputer.
                                                                    Its
                                                                    specs
                                                                    are:
                                                                    Compute
                                                                    nodes:
                                                                    82,944
                                                                    nodes
                                                                    (SPARC64
                                                                    VIIIfx;
                                                                    16GB
                                                                    of memory
                                                                    per
                                                                    node)

                                                                    8
                                                                    cores
                                                                    /
                                                                    processor
                                                                    
Interconnect:
                                                                    Tofu
                                                                    
(6-dimensional
                                                                    mesh/torus)
                                                                    Interconnect
                                                                    Each
                                                                    cabinet
                                                                    contains
                                                                    96 computing
                                                                    nodes,
                                                                    One
                                                                    of the
                                                                    requirement
                                                                    is to
                                                                    give
                                                                    the
                                                                    performance
                                                                    of my
                                                                    current
                                                                    code
                                                                    with
                                                                    my current
                                                                    set
                                                                    of data,
                                                                    and
                                                                    there
                                                                    is a
                                                                    formula
                                                                    to calculate
                                                                    the
                                                                    estimated
                                                                    parallel
                                                                    efficiency
                                                                    when
                                                                    using
                                                                    the
                                                                    new
                                                                    large
                                                                    set
                                                                    of data
                                                                    There
                                                                    are
                                                                    2
                                                                    ways
                                                                    to give
                                                                    performance:
                                                                    1. Strong
                                                                    scaling,
                                                                    which
                                                                    is defined
                                                                    as how
                                                                    the
                                                                    elapsed
                                                                    time
                                                                    varies
                                                                    with
                                                                    the
                                                                    number
                                                                    of 
processors
                                                                    for
                                                                    a
                                                                    fixed
                                                                    problem.
                                                                    2. Weak
                                                                    scaling,
                                                                    which
                                                                    is defined
                                                                    as how
                                                                    the
                                                                    elapsed
                                                                    time
                                                                    varies
                                                                    with
                                                                    the
                                                                    number
                                                                    of 
processors
                                                                    for a
                                                                    fixed
                                                                    problem
                                                                    size
                                                                    per
                                                                    processor.
                                                                    I
                                                                    ran
                                                                    my cases
                                                                    with
                                                                    48 and
                                                                    96 cores
                                                                    with
                                                                    my current
                                                                    cluster,
                                                                    giving
                                                                    140
                                                                    and
                                                                    90 mins
                                                                    
respectively.
                                                                    This
                                                                    is 
classified
                                                                    as strong
                                                                    scaling.
                                                                    Cluster
                                                                    specs:
                                                                    CPU:
                                                                    AMD
                                                                    6234
                                                                    2.4GHz
                                                                    8
                                                                    cores
                                                                    /
                                                                    processor
                                                                    (CPU)
                                                                    6
                                                                    CPU
                                                                    /
                                                                    node
                                                                    So 48
                                                                    Cores
                                                                    / CPU
                                                                    Not
                                                                    sure
                                                                    abt
                                                                    the
                                                                    memory
                                                                    /
                                                                    node

                                                                    The
                                                                    parallel
                                                                    efficiency
                                                                    ‘En’
                                                                    for
                                                                    a
                                                                    given
                                                                    degree
                                                                    of 
parallelism
                                                                    ‘n’
                                                                    indicates
                                                                    how
                                                                    much
                                                                    the
                                                                    program
                                                                    is
                                                                    efficiently
                                                                    accelerated
                                                                    by parallel
                                                                    processing.
                                                                    ‘En’
                                                                    is given
                                                                    by the
                                                                    following
                                                                    formulae.
                                                                    Although
                                                                    their
                                                                    derivation
                                                                    processes
                                                                    are
                                                                    different
                                                                    depending
                                                                    on strong
                                                                    and
                                                                    weak
                                                                    scaling,
                                                                    derived
                                                                    formulae
                                                                    are
                                                                    the
                                                                    same.
                                                                     From
                                                                    the
                                                                    estimated
                                                                    time,
                                                                    my parallel
                                                                    efficiency
                                                                    using
                                                                    Amdahl's
                                                                    law
                                                                    on the
                                                                    current
                                                                    old
                                                                    cluster
                                                                    was
                                                                    52.7%.
                                                                    So is
                                                                    my results
                                                                    acceptable?
                                                                    For
                                                                    the
                                                                    large
                                                                    data
                                                                    set,
                                                                    if using
                                                                    2205
                                                                    nodes
                                                                    
(2205X8cores),
                                                                    my expected
                                                                    parallel
                                                                    efficiency
                                                                    is only
                                                                    0.5%.
                                                                    The
                                                                    proposal
                                                                    recommends
                                                                    value
                                                                    of >
                                                                    50%.
                                                                    The
                                                                    problem
                                                                    with
                                                                    this
                                                                    analysis
                                                                    is that
                                                                    the
                                                                    estimated
                                                                    serial
                                                                    fraction
                                                                    from
                                                                    Amdahl's
                                                                    Law
                                                                    changes
                                                                    as a
                                                                    function
                                                                    of problem
                                                                    size,
                                                                    so you
                                                                    cannot
                                                                    take
                                                                    the
                                                                    strong
                                                                    scaling
                                                                    from
                                                                    one
                                                                    problem
                                                                    and
                                                                    apply
                                                                    it to
                                                                    another
                                                                    without
                                                                    a
                                                                    model
                                                                    of this
                                                                    dependence.

                                                                    Weak
                                                                    scaling
                                                                    does
                                                                    model
                                                                    changes
                                                                    with
                                                                    problem
                                                                    size,
                                                                    so I
                                                                    would
                                                                    measure
                                                                    weak
                                                                    scaling
                                                                    on your
                                                                    current
                                                                    cluster,
                                                                    and
                                                                    extrapolate
                                                                    to the
                                                                    big
                                                                    machine.
                                                                    I
                                                                    realize
                                                                    that
                                                                    this
                                                                    does
                                                                    not
                                                                    make
                                                                    sense
                                                                    for
                                                                    many
                                                                    scientific
                                                                    
applications,
                                                                    but
                                                                    neither
                                                                    does
                                                                    requiring
                                                                    a
                                                                    certain
                                                                    parallel
                                                                    efficiency.

                                                                Ok I
                                                                check
                                                                the
                                                                results
                                                                for
                                                                my
                                                                weak
                                                                scaling
                                                                it is
                                                                even
                                                                worse
                                                                for
                                                                the
                                                                expected
                                                                parallel
                                                                efficiency.
                                                                From
                                                                the
                                                                formula
                                                                used,
                                                                it's
                                                                obvious
                                                                it's
                                                                doing
                                                                some
                                                                sort
                                                                of
                                                                exponential
                                                                extrapolation
                                                                decrease.
                                                                So
                                                                unless I
                                                                can
                                                                achieve
                                                                a
                                                                near
                                                                > 90%
                                                                speed
                                                                up
                                                                when
                                                                I
                                                                double the
                                                                cores
                                                                and
                                                                problem
                                                                size
                                                                for
                                                                my
                                                                current
                                                                48/96
                                                                cores
                                                                setup,  
extrapolating
                                                                from
                                                                about
                                                                96
                                                                nodes
                                                                to
                                                                10,000 nodes
                                                                will
                                                                give
                                                                a
                                                                much
                                                                lower
                                                                expected
                                                                parallel
                                                                efficiency
                                                                for
                                                                the
                                                                new case.

                                                                However,
                                                                it's
                                                                mentioned
                                                                in
                                                                the
                                                                FAQ
                                                                that
                                                                due
                                                                to
                                                                memory 
requirement,
                                                                it's
                                                                impossible
                                                                to
                                                                get
                                                                >90%
                                                                speed
                                                                when
                                                                I
                                                                double the
                                                                cores
                                                                and
                                                                problem
                                                                size
                                                                (ie
                                                                linear increase
                                                                in
                                                                performance),
                                                                which
                                                                means
                                                                that
                                                                I
                                                                can't
                                                                get
                                                                >90%
                                                                speed
                                                                up
                                                                when
                                                                I
                                                                double the
                                                                cores
                                                                and
                                                                problem
                                                                size
                                                                for
                                                                my
                                                                current
                                                                48/96
                                                                cores
                                                                setup. Is
                                                                that so?

                                                               What
                                                            is the
                                                            output of
                                                            -ksp_view
                                                            -log_summary
                                                            on the
                                                            problem
                                                            and then
                                                            on the
                                                            problem
                                                            doubled
                                                            in size
                                                            and
                                                            number of
                                                            processors?

                                                               Barry

                                                        Hi,

                                                        I have
                                                        attached the
                                                        output

                                                        48 cores: log48
                                                        96 cores: log96

                                                        There are 2
                                                        solvers - The
                                                        momentum
                                                        linear eqn
                                                        uses bcgs,
                                                        while the
                                                        Poisson eqn
                                                        uses hypre
                                                        BoomerAMG.

                                                        Problem size
                                                        doubled from
                                                        158x266x150
                                                        to 158x266x300.

                                                                So is
                                                                it
                                                                fair
                                                                to
                                                                say
                                                                that
                                                                the
                                                                main
                                                                problem
                                                                does
                                                                not
                                                                lie
                                                                in my
                                                                programming
                                                                skills,
                                                                but
                                                                rather the
                                                                way
                                                                the
                                                                linear equations
                                                                are
                                                                solved?

                                                                Thanks.

Thanks,

Matt

--What

                                                                    most
                                                                    
experimenters
                                                                    take
                                                                    for
                                                                    granted
                                                                    before
                                                                    they
                                                                    begin
                                                                    their
                                                                    experiments
                                                                    is 
infinitely
                                                                    more
                                                                    interesting
                                                                    than
                                                                    any
                                                                    results
                                                                    to which
                                                                    their
                                                                    experiments
                                                                    lead.
                                                                    -- Norbert
                                                                    Wiener

                                                        <log48.txt><log96.txt>

                                                
<log48_10.txt><log48.txt><log96.txt>

                                        <log96_100.txt><log48_100.txt>

                                <log96_100_2.txt><log48_100_2.txt>

                        <log64_100.txt><log8_100.txt>

--What most experimenters take for granted before they begin their

    experiments is infinitely more interesting than any results to
    which their experiments lead.
    -- Norbert Wiener

--

What most experimenters take for granted before they begin theirexperiments is infinitely more interesting than any results to whichtheir experiments lead.

-- Norbert Wiener

Re: [petsc-users] Scaling with number of cores

Reply via email to