Re: [petsc-users] Scaling with number of cores

TAY wee-beng Tue, 03 Nov 2015 04:59:42 -0800


On 3/11/2015 8:52 PM, Matthew Knepley wrote:

On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    I tried and have attached the log.

    Ya, my Poisson eqn has Neumann boundary condition. Do I need to
    specify some null space stuff?  Like KSPSetNullSpace or
    MatNullSpaceCreate?


Yes, you need to attach the constant null space to the matrix.

  Thanks,

     Matt

Ok so can you point me to a suitable example so that I know which one touse specifically?


Thanks.



    Thank you

    Yours sincerely,

    TAY wee-beng

    On 3/11/2015 12:45 PM, Barry Smith wrote:

            On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]
            <mailto:[email protected]>> wrote:

            Hi,

            I tried :

            1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg

            2. -poisson_pc_type gamg

            Run with -poisson_ksp_monitor_true_residual
        -poisson_ksp_monitor_converged_reason
        Does your poisson have Neumann boundary conditions? Do you
        have any zeros on the diagonal for the matrix (you shouldn't).

           There may be something wrong with your poisson
        discretization that was also messing up hypre



            Both options give:

                1      0.00150000      0.00000000 0.00000000
            1.00000000             NaN  NaN             NaN
            M Diverged but why?, time =            2
            reason =           -9

            How can I check what's wrong?

            Thank you

            Yours sincerely,

            TAY wee-beng

            On 3/11/2015 3:18 AM, Barry Smith wrote:

                    hypre is just not scaling well here. I do not know
                why. Since hypre is a block box for us there is no way
                to determine why the poor scaling.

                    If you make the same two runs with -pc_type gamg
                there will be a lot more information in the log
                summary about in what routines it is scaling well or
                poorly.

                   Barry



                    On Nov 2, 2015, at 3:17 AM, TAY
                    wee-beng<[email protected]
                    <mailto:[email protected]>> wrote:

                    Hi,

                    I have attached the 2 files.

                    Thank you

                    Yours sincerely,

                    TAY wee-beng

                    On 2/11/2015 2:55 PM, Barry Smith wrote:

                           Run (158/2)x(266/2)x(150/2) grid on 8
                        processes  and then (158)x(266)x(150) on 64
                        processors  and send the two -log_summary results

                           Barry


                            On Nov 2, 2015, at 12:19 AM, TAY
                            wee-beng<[email protected]
                            <mailto:[email protected]>> wrote:

                            Hi,

                            I have attached the new results.

                            Thank you

                            Yours sincerely,

                            TAY wee-beng

                            On 2/11/2015 12:27 PM, Barry Smith wrote:

                                   Run without the -momentum_ksp_view
                                -poisson_ksp_view and send the new results


                                   You can see from the log summary
                                that the PCSetUp is taking a much
                                smaller percentage of the time meaning
                                that it is reusing the preconditioner
                                and not rebuilding it each time.

                                Barry

                                   Something makes no sense with the
                                output: it gives

                                KSPSolve             199 1.0
                                2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
                                9.9e+05 5.0e+02 90100 66100 24  90100
                                66100 24   165

                                90% of the time is in the solve but
                                there is no significant amount of time
                                in other events of the code which is
                                just not possible. I hope it is due to
                                your IO.



                                    On Nov 1, 2015, at 10:02 PM, TAY
                                    wee-beng<[email protected]
                                    <mailto:[email protected]>> wrote:

                                    Hi,

                                    I have attached the new run with
                                    100 time steps for 48 and 96 cores.

                                    Only the Poisson eqn 's RHS
                                    changes, the LHS doesn't. So if I
                                    want to reuse the preconditioner,
                                    what must I do? Or what must I not do?

                                    Why does the number of processes
                                    increase so much? Is there
                                    something wrong with my coding?
                                    Seems to be so too for my new run.

                                    Thank you

                                    Yours sincerely,

                                    TAY wee-beng

                                    On 2/11/2015 9:49 AM, Barry Smith
                                    wrote:

                                           If you are doing many time
                                        steps with the same linear
                                        solver then you MUST do your
                                        weak scaling studies with MANY
                                        time steps since the setup
                                        time of AMG only takes place
                                        in the first stimestep. So run
                                        both 48 and 96 processes with
                                        the same large number of time
                                        steps.

                                           Barry



                                            On Nov 1, 2015, at 7:35
                                            PM, TAY
                                            wee-beng<[email protected]
                                            <mailto:[email protected]>>
                                            wrote:

                                            Hi,

                                            Sorry I forgot and use the
                                            old a.out. I have attached
                                            the new log for 48cores
                                            (log48), together with the
                                            96cores log (log96).

                                            Why does the number of
                                            processes increase so
                                            much? Is there something
                                            wrong with my coding?

                                            Only the Poisson eqn 's
                                            RHS changes, the LHS
                                            doesn't. So if I want to
                                            reuse the preconditioner,
                                            what must I do? Or what
                                            must I not do?

                                            Lastly, I only simulated 2
                                            time steps previously. Now
                                            I run for 10 timesteps
                                            (log48_10). Is it building
                                            the preconditioner at
                                            every timestep?

                                            Also, what about momentum
                                            eqn? Is it working well?

                                            I will try the gamg later too.

                                            Thank you

                                            Yours sincerely,

                                            TAY wee-beng

                                            On 2/11/2015 12:30 AM,
                                            Barry Smith wrote:

                                                   You used gmres with
                                                48 processes but
                                                richardson with 96.
                                                You need to be careful
                                                and make sure you
                                                don't change the
                                                solvers when you
                                                change the number of
                                                processors since you
                                                can get very different
                                                inconsistent results

                                                    Anyways all the
                                                time is being spent in
                                                the BoomerAMG
                                                algebraic multigrid
                                                setup and it is is
                                                scaling badly. When
                                                you double the problem
                                                size and number of
                                                processes it went from
                                                3.2445e+01 to
                                                4.3599e+02 seconds.

PCSetUp3 1.0 3.2445e+01 1.0

                                                9.58e+06 2.0 0.0e+00
                                                0.0e+00 4.0e+00 62  8

0 0 4 62 8 0 05 11

PCSetUp3 1.0 4.3599e+02 1.0

                                                9.58e+06 2.0 0.0e+00
                                                0.0e+00 4.0e+00 85 18

0 0 6 85 18 0 06 2


                                                   Now is the Poisson
                                                problem changing at
                                                each timestep or can
                                                you use the same
                                                preconditioner built
                                                with BoomerAMG for all
                                                the time steps?
                                                Algebraic multigrid
                                                has a large set up
                                                time that you often
                                                doesn't matter if you
                                                have many time steps
                                                but if you have to
                                                rebuild it each
                                                timestep it is too large?

                                                   You might also try
                                                -pc_type gamg and see
                                                how PETSc's algebraic
                                                multigrid scales for
                                                your problem/machine.

                                                   Barry



                                                    On Nov 1, 2015, at
                                                    7:30 AM, TAY
                                                    wee-beng<[email protected]
                                                    <mailto:[email protected]>>
                                                    wrote:


                                                    On 1/11/2015 10:00
                                                    AM, Barry Smith wrote:

                                                            On Oct 31,
                                                            2015, at
                                                            8:43 PM,
                                                            TAY
                                                            
wee-beng<[email protected]
                                                            
<mailto:[email protected]>>
                                                            wrote:


                                                            On
                                                            1/11/2015
                                                            12:47 AM,
                                                            Matthew
                                                            Knepley wrote:

                                                                On
                                                                Sat,
                                                                Oct
                                                                31,
                                                                2015
                                                                at
                                                                11:34
                                                                AM,
                                                                TAY
                                                                
wee-beng<[email protected]
                                                                
<mailto:[email protected]>>
                                                                wrote:
                                                                Hi,

                                                                I
                                                                understand
                                                                that
                                                                as
                                                                mentioned
                                                                in the
                                                                faq,
                                                                due to
                                                                the
                                                                limitations
                                                                in
                                                                memory, the
                                                                scaling is
                                                                not
                                                                linear. So,
                                                                I am
                                                                trying
                                                                to
                                                                write
                                                                a
                                                                proposal
                                                                to use
                                                                a
                                                                supercomputer.
                                                                Its
                                                                specs are:
                                                                Compute nodes:
                                                                82,944
                                                                nodes
                                                                (SPARC64
                                                                VIIIfx; 16GB
                                                                of
                                                                memory
                                                                per node)

                                                                8
                                                                cores
                                                                /
                                                                processor
                                                                Interconnect:
                                                                Tofu
                                                                (6-dimensional
                                                                mesh/torus)
                                                                Interconnect
                                                                Each
                                                                cabinet contains
                                                                96
                                                                computing
                                                                nodes,
                                                                One of
                                                                the
                                                                requirement
                                                                is to
                                                                give
                                                                the
                                                                performance
                                                                of my
                                                                current code
                                                                with
                                                                my
                                                                current set
                                                                of
                                                                data,
                                                                and
                                                                there
                                                                is a
                                                                formula to
                                                                calculate
                                                                the
                                                                estimated
                                                                parallel
                                                                efficiency
                                                                when
                                                                using
                                                                the
                                                                new
                                                                large
                                                                set of
                                                                data
                                                                There
                                                                are 2
                                                                ways
                                                                to
                                                                give
                                                                performance:
                                                                1.
                                                                Strong
                                                                scaling,
                                                                which
                                                                is
                                                                defined as
                                                                how
                                                                the
                                                                elapsed time
                                                                varies
                                                                with
                                                                the
                                                                number
                                                                of
                                                                processors
                                                                for a
                                                                fixed
                                                                problem.
                                                                2.
                                                                Weak
                                                                scaling,
                                                                which
                                                                is
                                                                defined as
                                                                how
                                                                the
                                                                elapsed time
                                                                varies
                                                                with
                                                                the
                                                                number
                                                                of
                                                                processors
                                                                for a
                                                                fixed
                                                                problem size
                                                                per
                                                                processor.
                                                                I ran
                                                                my
                                                                cases
                                                                with
                                                                48 and
                                                                96
                                                                cores
                                                                with
                                                                my
                                                                current cluster,
                                                                giving
                                                                140
                                                                and 90
                                                                mins
                                                                respectively.
                                                                This
                                                                is
                                                                classified
                                                                as
                                                                strong
                                                                scaling.
                                                                Cluster specs:
                                                                CPU:
                                                                AMD
                                                                6234
                                                                2.4GHz
                                                                8
                                                                cores
                                                                /
                                                                processor
                                                                (CPU)
                                                                6 CPU
                                                                / node
                                                                So 48
                                                                Cores
                                                                / CPU
                                                                Not
                                                                sure
                                                                abt
                                                                the
                                                                memory
                                                                / node

                                                                The
                                                                parallel
                                                                efficiency
                                                                ‘En’
                                                                for a
                                                                given
                                                                degree
                                                                of
                                                                parallelism
                                                                ‘n’
                                                                indicates
                                                                how
                                                                much
                                                                the
                                                                program is
                                                                efficiently
                                                                accelerated
                                                                by
                                                                parallel
                                                                processing.
                                                                ‘En’
                                                                is
                                                                given
                                                                by the
                                                                following
                                                                formulae.
                                                                Although
                                                                their
                                                                derivation
                                                                processes
                                                                are
                                                                different
                                                                depending
                                                                on
                                                                strong
                                                                and
                                                                weak
                                                                scaling,
                                                                derived formulae
                                                                are the
                                                                same.
                                                                 From
                                                                the
                                                                estimated
                                                                time,
                                                                my
                                                                parallel
                                                                efficiency
                                                                using
                                                                Amdahl's
                                                                law on
                                                                the
                                                                current old
                                                                cluster was
                                                                52.7%.
                                                                So is
                                                                my
                                                                results 
acceptable?
                                                                For
                                                                the
                                                                large
                                                                data
                                                                set,
                                                                if
                                                                using
                                                                2205
                                                                nodes
                                                                (2205X8cores),
                                                                my
                                                                expected
                                                                parallel
                                                                efficiency
                                                                is
                                                                only
                                                                0.5%.
                                                                The
                                                                proposal
                                                                recommends
                                                                value
                                                                of > 50%.
                                                                The
                                                                problem with
                                                                this
                                                                analysis
                                                                is
                                                                that
                                                                the
                                                                estimated
                                                                serial
                                                                fraction
                                                                from
                                                                Amdahl's

Lawchanges as

                                                                a function
                                                                of
                                                                problem size,
                                                                so you
                                                                cannot
                                                                take
                                                                the
                                                                strong
                                                                scaling from
                                                                one
                                                                problem and
                                                                apply
                                                                it to
                                                                another without
                                                                a
                                                                model
                                                                of
                                                                this
                                                                dependence.

                                                                Weak
                                                                scaling does
                                                                model
                                                                changes with
                                                                problem size,
                                                                so I
                                                                would
                                                                measure weak
                                                                scaling on
                                                                your
                                                                current
                                                                cluster,
                                                                and
                                                                extrapolate
                                                                to the
                                                                big
                                                                machine.
                                                                I
                                                                realize that
                                                                this
                                                                does
                                                                not
                                                                make
                                                                sense
                                                                for
                                                                many
                                                                scientific
                                                                applications,
                                                                but
                                                                neither does
                                                                requiring
                                                                a
                                                                certain parallel
                                                                efficiency.

                                                            Ok I check
                                                            the
                                                            results
                                                            for my
                                                            weak
                                                            scaling it
                                                            is even
                                                            worse for
                                                            the
                                                            expected
                                                            parallel
                                                            efficiency. From
                                                            the
                                                            formula
                                                            used, it's
                                                            obvious
                                                            it's doing
                                                            some sort
                                                            of
                                                            exponential 
extrapolation
                                                            decrease.
                                                            So unless
                                                            I can
                                                            achieve a
                                                            near > 90%
                                                            speed up
                                                            when I
                                                            double the
                                                            cores and
                                                            problem
                                                            size for
                                                            my current
                                                            48/96
                                                            cores

setup,extrapolating

                                                            from about
                                                            96 nodes
                                                            to 10,000
                                                            nodes will
                                                            give a
                                                            much lower
                                                            expected
                                                            parallel
                                                            efficiency
                                                            for the
                                                            new case.

                                                            However,
                                                            it's
                                                            mentioned
                                                            in the FAQ
                                                            that due
                                                            to memory
                                                            requirement,
                                                            it's
                                                            impossible
                                                            to get
                                                            >90% speed
                                                            when I
                                                            double the
                                                            cores and
                                                            problem
                                                            size (ie
                                                            linear
                                                            increase
                                                            in
                                                            performance),
                                                            which
                                                            means that
                                                            I can't
                                                            get >90%
                                                            speed up
                                                            when I
                                                            double the
                                                            cores and
                                                            problem
                                                            size for
                                                            my current
                                                            48/96
                                                            cores
                                                            setup. Is
                                                            that so?

                                                           What is the
                                                        output of
                                                        -ksp_view
                                                        -log_summary
                                                        on the problem
                                                        and then on
                                                        the problem
                                                        doubled in
                                                        size and
                                                        number of
                                                        processors?

                                                           Barry

                                                    Hi,

                                                    I have attached
                                                    the output

                                                    48 cores: log48
                                                    96 cores: log96

                                                    There are 2
                                                    solvers - The
                                                    momentum linear
                                                    eqn uses bcgs,
                                                    while the Poisson
                                                    eqn uses hypre
                                                    BoomerAMG.

                                                    Problem size
                                                    doubled from
                                                    158x266x150 to
                                                    158x266x300.

                                                            So is it
                                                            fair to
                                                            say that
                                                            the main
                                                            problem
                                                            does not
                                                            lie in my
                                                            programming skills,
                                                            but rather
                                                            the way
                                                            the linear
                                                            equations
                                                            are solved?

                                                            Thanks.

                                                                   Thanks,

                                                                      Matt
                                                                Is it
                                                                possible
                                                                for
                                                                this
                                                                type
                                                                of
                                                                scaling in
                                                                PETSc
                                                                (>50%), when
                                                                using
                                                                17640
                                                                (2205X8)
                                                                cores?
                                                                Btw, I
                                                                do not
                                                                have
                                                                access
                                                                to the
                                                                system.



                                                                Sent
                                                                using
                                                                CloudMagic
                                                                Email

--What

                                                                most
                                                                experimenters
                                                                take
                                                                for
                                                                granted before
                                                                they
                                                                begin
                                                                their
                                                                experiments
                                                                is
                                                                infinitely
                                                                more
                                                                interesting
                                                                than
                                                                any
                                                                results to
                                                                which
                                                                their
                                                                experiments
                                                                lead.
                                                                --
                                                                Norbert Wiener

                                                    <log48.txt><log96.txt>

                                            <log48_10.txt><log48.txt><log96.txt>

                                    <log96_100.txt><log48_100.txt>

                            <log96_100_2.txt><log48_100_2.txt>

                    <log64_100.txt><log8_100.txt>





--

What most experimenters take for granted before they begin theirexperiments is infinitely more interesting than any results to whichtheir experiments lead.

-- Norbert Wiener

Re: [petsc-users] Scaling with number of cores

Reply via email to