Hi Hong,

Thanks for checking this. A mechanical model was added at the time when the solver failed, causing some problem. We need to improve this part in the code.

Thanks again and best wishes,

Danyang

On 15-12-08 08:10 PM, Hong wrote:
Danyang :
Your matrices are ill-conditioned, numerically singular with
Recip. condition number = 6.000846e-16
Recip. condition number = 2.256434e-27
Recip. condition number = 1.256452e-18
i.e., condition numbers = O(1.e16 - 1.e27), there is no accuracy in computed solution.

I checked your matrix 168 - 172, got Recip. condition number = 1.548816e-12.

You need check your model to understand why the matrices are so ill-conditioned.

Hong

    Hi Hong,

    Sorry to bother you again. The modified code works much better
    than before using both superlu or mumps. However, it still
    encounters failure. The case is similar with the previous one,
    ill-conditioned matrices.

    The code crashed after a long time simulation if I use
    superlu_dist, but will not fail if use superlu. I restart the
    simulation before the time it crashes and can reproduce the
    following error

timestep: 22 time: 1.750E+04 years delt: 2.500E+00 years iter: 1 max.sia: 5.053E-03 tol.sia: 1.000E-01
     Newton Iteration Convergence Summary:
     Newton       maximum      maximum     solver
iteration updatePa updateTemp residual iterations maxvolpa maxvoltemp nexvolpa nexvoltemp 1 0.1531E+08 0.1755E+04 0.6920E-05 1 5585 4402 5814 5814

    *** Error in `../program_test': malloc(): memory corruption:
    0x0000000003a70d50 ***
    Program received signal SIGABRT: Process abort signal.
    Backtrace for this error:

    The solver failed at timestep 22, Newton iteration 2. I exported
    the matrices at timestep 1 (matrix 1) and timestep 22 (matrix 140
    and 141). Matrix 141 is where it failed.  The three matrices here
    are not ill-conditioned form the estimated value.

    I did the same using the new modified ex52f code and found pretty
    different results for matrix 141. The norm by superlu is much
    acceptable than superlu_dist. In this test, memory corruption was
    not detected. The codes and example data can be download from the
    link below.

    https://www.dropbox.com/s/i1ls0bg0vt7gu0v/petsc-superlu-test2.tar.gz?dl=0


    ****************More test on matrix_and_rhs_bin2*******************
    mpiexec.hydra -n 1 ./ex52f -f0
    ./matrix_and_rhs_bin2/a_flow_check_1.bin -rhs
    ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check
    -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140
    -matrix_index_end 141  -pc_type lu -pc_factor_mat_solver_package
    superlu -ksp_monitor_true_residual -mat_superlu_conditionnumber
     -->loac matrix a
     -->load rhs b
     size l,m,n,mm       90000       90000 90000       90000
      Recip. condition number = 6.000846e-16
      0 KSP preconditioned resid norm 1.146871454377e+08 true resid
    norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
      1 KSP preconditioned resid norm 2.071118508260e-06 true resid
    norm 3.363767171515e-08 ||r(i)||/||b|| 7.140102249181e-12
    Norm of error  3.3638E-08 iterations     1
     -->Test for matrix          140
      Recip. condition number = 2.256434e-27
      0 KSP preconditioned resid norm 2.084372893355e+14 true resid
    norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
      1 KSP preconditioned resid norm 4.689629276419e+00 true resid
    norm 1.037236635337e-01 ||r(i)||/||b|| 2.201690918330e-05
    Norm of error  1.0372E-01 iterations     1
     -->Test for matrix          141
      Recip. condition number = 1.256452e-18
      0 KSP preconditioned resid norm 1.055488964519e+08 true resid
    norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00
      1 KSP preconditioned resid norm 2.998827511681e-04 true resid
    norm 4.805214542776e-04 ||r(i)||/||b|| 1.019979130994e-07
    Norm of error  4.8052E-04 iterations     1
     --> End of test, bye


    mpiexec.hydra -n 1 ./ex52f -f0
    ./matrix_and_rhs_bin2/a_flow_check_1.bin -rhs
    ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check
    -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140
    -matrix_index_end 141  -pc_type lu -pc_factor_mat_solver_package
    superlu_dist
     -->loac matrix a
     -->load rhs b
     size l,m,n,mm       90000       90000 90000       90000
    Norm of error  3.6752E-08 iterations     1
     -->Test for matrix          140
    Norm of error  1.6335E-01 iterations     1
     -->Test for matrix          141
    Norm of error  3.4345E+01 iterations     1
     --> End of test, bye

    Thanks,

    Danyang

    On 15-12-07 12:01 PM, Hong wrote:
    Danyang:
    Add 'call MatSetFromOptions(A,ierr)' to your code.
    Attached below is ex52f.F modified from your ex52f.F to be
    compatible with petsc-dev.

    Hong

        Hello Hong,

        Thanks for the quick reply and the option
        "-mat_superlu_dist_fact SamePattern" works like a charm, if I
        use this option from the command line.

        How can I add this option as the default. I tried using
        PetscOptionsInsertString("-mat_superlu_dist_fact
        SamePattern",ierr) in my code but this does not work.

        Thanks,

        Danyang


        On 15-12-07 10:42 AM, Hong wrote:
        Danyang :

        Adding '-mat_superlu_dist_fact SamePattern' fixed the
        problem. Below is how I figured it out.

        1. Reading ex52f.F, I see '-superlu_default' =
        '-pc_factor_mat_solver_package superlu_dist', the later
        enables runtime options for other packages. I use
        superlu_dist-4.2 and superlu-4.1 for the tests below.

        2. Use the Matrix 168 to setup KSP solver and factorization,
        all packages, petsc, superlu_dist and mumps give same
        correct results:

        ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
        matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
        flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
        -pc_factor_mat_solver_package petsc
         -->loac matrix a
         -->load rhs b
         size l,m,n,mm       90000 90000       90000 90000
        Norm of error  7.7308E-11 iterations     1
         -->Test for matrix          168
        ..
         -->Test for matrix          172
        Norm of error  3.8461E-11 iterations     1

        ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
        matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
        flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
        -pc_factor_mat_solver_package superlu_dist
        Norm of error  9.4073E-11 iterations     1
         -->Test for matrix          168
        ...
         -->Test for matrix          172
        Norm of error  3.8187E-11 iterations     1

        3. Use superlu, I get
        ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
        matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
        flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
        -pc_factor_mat_solver_package superlu
        Norm of error  1.0191E-06 iterations     1
         -->Test for matrix          168
        ...
         -->Test for matrix          172
        Norm of error  9.7858E-07 iterations     1

        Replacing default DiagPivotThresh: 1. to 0.0, I get same
        solutions as other packages:

        ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
        matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
        flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
        -pc_factor_mat_solver_package superlu
        -mat_superlu_diagpivotthresh 0.0

        Norm of error  8.3614E-11 iterations     1
         -->Test for matrix          168
        ...
         -->Test for matrix          172
        Norm of error  3.7098E-11 iterations     1

        4.
        using '-mat_view ascii::ascii_info', I found that
        a_flow_check_1.bin and a_flow_check_168.bin seem have same
        structure:

         -->loac matrix a
        Mat Object: 1 MPI processes
        type: seqaij
        rows=90000, cols=90000
        total: nonzeros=895600, allocated nonzeros=895600
        total number of mallocs used during MatSetValues calls =0
          using I-node routines: found 45000 nodes, limit used is 5

        5.
        Using a_flow_check_1.bin, I am able to reproduce the error
        you reported: all packages give correct results except
        superlu_dist:
        ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
        matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
        flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
        -pc_factor_mat_solver_package superlu_dist
        Norm of error  2.5970E-12 iterations     1
         -->Test for matrix  168
        Norm of error  1.3936E-01 iterations    34
         -->Test for matrix  169

        I guess the error might come from reuse of matrix factor.
        Replacing default
        -mat_superlu_dist_fact <SamePattern_SameRowPerm> with
        -mat_superlu_dist_fact SamePattern, I get

        ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
        matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
        flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
        -pc_factor_mat_solver_package superlu_dist
        -mat_superlu_dist_fact SamePattern

        Norm of error  2.5970E-12 iterations     1
         -->Test for matrix  168
        Norm of error  9.4073E-11 iterations     1
         -->Test for matrix  169
        Norm of error  6.4303E-11 iterations     1
         -->Test for matrix  170
        Norm of error  7.4327E-11 iterations     1
         -->Test for matrix  171
        Norm of error  5.4162E-11 iterations     1
         -->Test for matrix  172
        Norm of error  3.4440E-11 iterations     1
         --> End of test, bye

        Sherry may tell you why SamePattern_SameRowPerm cause the
        difference here.
        Best on the above experiments, I would set following as default
        '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
        '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist
        interface.

        Hong

            Hi Hong,

            I did more test today and finally found that the
            solution accuracy depends on the initial (first) matrix
            quality. I modified the ex52f.F to do the test. There
            are 6 matrices and right-hand-side vectors. All these
            matrices and rhs are from my reactive transport
            simulation. Results will be quite different depending on
            which one you use to do factorization. Results will also
            be different if you run with different options. My code
            is similar to the First or the Second test below. When
            the matrix is well conditioned, it works fine. But if
            the initial matrix is well conditioned, it likely to
            crash when the matrix become ill-conditioned. Since most
            of my case are well conditioned so I didn't detect the
            problem before. This case is a special one.


            How can I avoid this problem? Shall I redo
            factorization? Can PETSc automatically detect this
            prolbem or is there any option available to do this?

            All the data and test code (modified ex52f) can be found
            via the dropbox link below.
            _
            
__https://www.dropbox.com/s/4al1a60creogd8m/petsc-superlu-test.tar.gz?dl=0_


            Summary of my test is shown below.

            First, use the Matrix 1 to setup KSP solver and
            factorization, then solve 168 to 172

            mpiexec.hydra -n 1 ./ex52f -f0
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
            -rhs
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
            -loop_matrices flow_check -loop_folder
            /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
            -pc_type lu -pc_factor_mat_solver_package superlu_dist

            Norm of error 3.8815E-11 iterations     1
             -->Test for matrix          168
            Norm of error 4.2307E-01 iterations 32
             -->Test for matrix          169
            Norm of error 3.0528E-01 iterations 32
             -->Test for matrix          170
            Norm of error 3.1177E-01 iterations 32
             -->Test for matrix          171
            Norm of error 3.2793E-01 iterations 32
             -->Test for matrix          172
            Norm of error 3.1251E-01 iterations 31

            Second, use the Matrix 1 to setup KSP solver and
            factorization using the implemented SuperLU relative
            codes. I thought this will generate the same results as
            the First test, but it actually not.

            mpiexec.hydra -n 1 ./ex52f -f0
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
            -rhs
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
            -loop_matrices flow_check -loop_folder
            /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
            -superlu_default

            Norm of error 2.2632E-12 iterations     1
             -->Test for matrix          168
            Norm of error 1.0817E+04 iterations     1
             -->Test for matrix          169
            Norm of error 1.0786E+04 iterations     1
             -->Test for matrix          170
            Norm of error 1.0792E+04 iterations     1
             -->Test for matrix          171
            Norm of error 1.0792E+04 iterations     1
             -->Test for matrix          172
            Norm of error 1.0792E+04 iterations     1


            Third, use the Matrix 168 to setup KSP solver and
            factorization, then solve 168 to 172

            mpiexec.hydra -n 1 ./ex52f -f0
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
            -rhs
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
            -loop_matrices flow_check -loop_folder
            /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
            -pc_type lu -pc_factor_mat_solver_package superlu_dist

            Norm of error 9.5528E-10 iterations     1
             -->Test for matrix          168
            Norm of error 9.4945E-10 iterations     1
             -->Test for matrix          169
            Norm of error 6.4279E-10 iterations     1
             -->Test for matrix          170
            Norm of error 7.4633E-10 iterations     1
             -->Test for matrix          171
            Norm of error 7.4863E-10 iterations     1
             -->Test for matrix          172
            Norm of error 8.9701E-10 iterations     1

            Fourth, use the Matrix 168 to setup KSP solver and
            factorization using the implemented SuperLU relative
            codes. I thought this will generate the same results as
            the Third test, but it actually not.

            mpiexec.hydra -n 1 ./ex52f -f0
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
            -rhs
            
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
            -loop_matrices flow_check -loop_folder
            /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
            -superlu_default

            Norm of error 3.7017E-11 iterations     1
             -->Test for matrix          168
            Norm of error 3.6420E-11 iterations     1
             -->Test for matrix          169
            Norm of error 3.7184E-11 iterations     1
             -->Test for matrix          170
            Norm of error 3.6847E-11 iterations     1
             -->Test for matrix          171
            Norm of error 3.7883E-11 iterations     1
             -->Test for matrix          172
            Norm of error 3.8805E-11 iterations     1

            Thanks very much,

            Danyang

            On 15-12-03 01:59 PM, Hong wrote:
            Danyang :
            Further testing a_flow_check_168.bin,
            ./ex10 -f0
            /Users/Hong/Downloads/matrix_and_rhs_bin/a_flow_check_168.bin
            -rhs
            /Users/Hong/Downloads/matrix_and_rhs_bin/x_flow_check_168.bin
            -pc_type lu -pc_factor_mat_solver_package superlu
            -ksp_monitor_true_residual -mat_superlu_conditionnumber
            Recip. condition number = 1.610480e-12
            0 KSP preconditioned resid norm 6.873340313547e+09 true
            resid norm 7.295020990196e+03 ||r(i)||/||b||
            1.000000000000e+00
            1 KSP preconditioned resid norm 2.051833296449e-02 true
            resid norm 2.976859070118e-02 ||r(i)||/||b||
            4.080672384793e-06
            Number of iterations =   1
            Residual norm 0.0297686

            condition number of this matrix = 1/1.610480e-12 = 1.e+12,
            i.e., this matrix is ill-conditioned.

            Hong


                Hi Hong,

                The binary format of matrix, rhs and solution can
                be downloaded via the link below.

                
https://www.dropbox.com/s/cl3gfi0s0kjlktf/matrix_and_rhs_bin.tar.gz?dl=0

                Thanks,

                Danyang


                On 15-12-03 10:50 AM, Hong wrote:
                Danyang:



                    To my surprising, solutions from SuperLU at
                    timestep 29 seems not correct for the first 4
                    Newton iterations, but the solutions from
                    iteration solver and MUMPS are correct.

                    Please find all the matrices, rhs and
                    solutions at timestep 29 via the link below.
                    The data is a bit large so that I just share
                    it through Dropbox. A piece of matlab code to
                    read these data and then computer the norm has
                    also been attached.
                    
_https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0_


                Can you send us matrix in petsc binary format?

                e.g., call MatView(M,
                PETSC_VIEWER_BINARY_(PETSC_COMM_WORLD))
                or '-ksp_view_mat binary'

                Hong



                    Below is a summary of the norm from the three
                    solvers at timestep 29, newton iteration 1 to 5.

                    Timestep 29
                    Norm of residual seq 1.661321e-09, superlu
                    1.657103e+04, mumps 3.731225e-11
                    Norm of residual seq 1.753079e-09, superlu
                    6.675467e+02, mumps 1.509919e-13
                    Norm of residual seq 4.914971e-10, superlu
                    1.236362e-01, mumps 2.139303e-17
                    Norm of residual seq 3.532769e-10, superlu
                    1.304670e-04, mumps 5.387000e-20
                    Norm of residual seq 3.885629e-10, superlu
                    2.754876e-07, mumps 4.108675e-21

                    Would anybody please check if SuperLU can
                    solve these matrices? Another possibility is
                    that something is wrong in my own code. But so
                    far, I cannot find any problem in my code
                    since the same code works fine if I using
                    iterative solver or direct solver MUMPS. But
                    for other cases I have tested, all these
                    solvers work fine.

                    Please let me know if I did not write down the
                    problem clearly.

                    Thanks,

                    Danyang













Reply via email to