Hi Hong,

Sorry to bother you again. The modified code works much better than before using both superlu or mumps. However, it still encounters failure. The case is similar with the previous one, ill-conditioned matrices.

The code crashed after a long time simulation if I use superlu_dist, but will not fail if use superlu. I restart the simulation before the time it crashes and can reproduce the following error

timestep: 22 time: 1.750E+04 years delt: 2.500E+00 years iter: 1 max.sia: 5.053E-03 tol.sia: 1.000E-01
 Newton Iteration Convergence Summary:
 Newton       maximum      maximum     solver
iteration updatePa updateTemp residual iterations maxvolpa maxvoltemp nexvolpa nexvoltemp 1 0.1531E+08 0.1755E+04 0.6920E-05 1 5585 4402 5814 5814

*** Error in `../program_test': malloc(): memory corruption: 0x0000000003a70d50 ***
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:

The solver failed at timestep 22, Newton iteration 2. I exported the matrices at timestep 1 (matrix 1) and timestep 22 (matrix 140 and 141). Matrix 141 is where it failed. The three matrices here are not ill-conditioned form the estimated value.

I did the same using the new modified ex52f code and found pretty different results for matrix 141. The norm by superlu is much acceptable than superlu_dist. In this test, memory corruption was not detected. The codes and example data can be download from the link below.

https://www.dropbox.com/s/i1ls0bg0vt7gu0v/petsc-superlu-test2.tar.gz?dl=0


****************More test on matrix_and_rhs_bin2*******************
mpiexec.hydra -n 1 ./ex52f -f0 ./matrix_and_rhs_bin2/a_flow_check_1.bin -rhs ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140 -matrix_index_end 141 -pc_type lu -pc_factor_mat_solver_package superlu -ksp_monitor_true_residual -mat_superlu_conditionnumber
 -->loac matrix a
 -->load rhs b
 size l,m,n,mm       90000       90000       90000       90000
  Recip. condition number = 6.000846e-16
0 KSP preconditioned resid norm 1.146871454377e+08 true resid norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.071118508260e-06 true resid norm 3.363767171515e-08 ||r(i)||/||b|| 7.140102249181e-12
Norm of error  3.3638E-08 iterations     1
 -->Test for matrix          140
  Recip. condition number = 2.256434e-27
0 KSP preconditioned resid norm 2.084372893355e+14 true resid norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.689629276419e+00 true resid norm 1.037236635337e-01 ||r(i)||/||b|| 2.201690918330e-05
Norm of error  1.0372E-01 iterations     1
 -->Test for matrix          141
  Recip. condition number = 1.256452e-18
0 KSP preconditioned resid norm 1.055488964519e+08 true resid norm 4.711091037809e+03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.998827511681e-04 true resid norm 4.805214542776e-04 ||r(i)||/||b|| 1.019979130994e-07
Norm of error  4.8052E-04 iterations     1
 --> End of test, bye


mpiexec.hydra -n 1 ./ex52f -f0 ./matrix_and_rhs_bin2/a_flow_check_1.bin -rhs ./matrix_and_rhs_bin2/b_flow_check_1.bin -loop_matrices flow_check -loop_folder ./matrix_and_rhs_bin2 -matrix_index_start 140 -matrix_index_end 141 -pc_type lu -pc_factor_mat_solver_package superlu_dist
 -->loac matrix a
 -->load rhs b
 size l,m,n,mm       90000       90000       90000       90000
Norm of error  3.6752E-08 iterations     1
 -->Test for matrix          140
Norm of error  1.6335E-01 iterations     1
 -->Test for matrix          141
Norm of error  3.4345E+01 iterations     1
 --> End of test, bye

Thanks,

Danyang

On 15-12-07 12:01 PM, Hong wrote:
Danyang:
Add 'call MatSetFromOptions(A,ierr)' to your code.
Attached below is ex52f.F modified from your ex52f.F to be compatible with petsc-dev.

Hong

    Hello Hong,

    Thanks for the quick reply and the option "-mat_superlu_dist_fact
    SamePattern" works like a charm, if I use this option from the
    command line.

    How can I add this option as the default. I tried using
    PetscOptionsInsertString("-mat_superlu_dist_fact
    SamePattern",ierr) in my code but this does not work.

    Thanks,

    Danyang


    On 15-12-07 10:42 AM, Hong wrote:
    Danyang :

    Adding '-mat_superlu_dist_fact SamePattern' fixed the problem.
    Below is how I figured it out.

    1. Reading ex52f.F, I see '-superlu_default' =
    '-pc_factor_mat_solver_package superlu_dist', the later enables
    runtime options for other packages. I use superlu_dist-4.2 and
    superlu-4.1 for the tests below.

    2. Use the Matrix 168 to setup KSP solver and factorization, all
    packages, petsc, superlu_dist and mumps give same correct results:

    ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
    matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
    -loop_folder matrix_and_rhs_bin -pc_type lu
    -pc_factor_mat_solver_package petsc
     -->loac matrix a
     -->load rhs b
     size l,m,n,mm   90000       90000       90000 90000
    Norm of error  7.7308E-11 iterations     1
     -->Test for matrix          168
    ..
     -->Test for matrix          172
    Norm of error  3.8461E-11 iterations     1

    ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
    matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
    -loop_folder matrix_and_rhs_bin -pc_type lu
    -pc_factor_mat_solver_package superlu_dist
    Norm of error  9.4073E-11 iterations     1
     -->Test for matrix          168
    ...
     -->Test for matrix          172
    Norm of error  3.8187E-11 iterations     1

    3. Use superlu, I get
    ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
    matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
    -loop_folder matrix_and_rhs_bin -pc_type lu
    -pc_factor_mat_solver_package superlu
    Norm of error  1.0191E-06 iterations     1
     -->Test for matrix          168
    ...
     -->Test for matrix          172
    Norm of error  9.7858E-07 iterations     1

    Replacing default DiagPivotThresh: 1. to 0.0, I get same
    solutions as other packages:

    ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_168.bin -rhs
    matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
    -loop_folder matrix_and_rhs_bin -pc_type lu
    -pc_factor_mat_solver_package superlu
    -mat_superlu_diagpivotthresh 0.0

    Norm of error  8.3614E-11 iterations     1
     -->Test for matrix          168
    ...
     -->Test for matrix          172
    Norm of error  3.7098E-11 iterations     1

    4.
    using '-mat_view ascii::ascii_info', I found that
    a_flow_check_1.bin and a_flow_check_168.bin seem have same structure:

     -->loac matrix a
    Mat Object: 1 MPI processes
      type: seqaij
      rows=90000, cols=90000
      total: nonzeros=895600, allocated nonzeros=895600
      total number of mallocs used during MatSetValues calls =0
        using I-node routines: found 45000 nodes, limit used is 5

    5.
    Using a_flow_check_1.bin, I am able to reproduce the error you
    reported: all packages give correct results except superlu_dist:
    ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
    matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
    -loop_folder matrix_and_rhs_bin -pc_type lu
    -pc_factor_mat_solver_package superlu_dist
    Norm of error  2.5970E-12 iterations     1
     -->Test for matrix          168
    Norm of error  1.3936E-01 iterations    34
     -->Test for matrix          169

    I guess the error might come from reuse of matrix factor.
    Replacing default
    -mat_superlu_dist_fact <SamePattern_SameRowPerm> with
    -mat_superlu_dist_fact SamePattern, I get

    ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
    matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
    -loop_folder matrix_and_rhs_bin -pc_type lu
    -pc_factor_mat_solver_package superlu_dist -mat_superlu_dist_fact
    SamePattern

    Norm of error  2.5970E-12 iterations     1
     -->Test for matrix          168
    Norm of error  9.4073E-11 iterations     1
     -->Test for matrix          169
    Norm of error  6.4303E-11 iterations     1
     -->Test for matrix          170
    Norm of error  7.4327E-11 iterations     1
     -->Test for matrix          171
    Norm of error  5.4162E-11 iterations     1
     -->Test for matrix          172
    Norm of error  3.4440E-11 iterations     1
     --> End of test, bye

    Sherry may tell you why SamePattern_SameRowPerm cause the
    difference here.
    Best on the above experiments, I would set following as default
    '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
    '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.

    Hong

        Hi Hong,

        I did more test today and finally found that the solution
        accuracy depends on the initial (first) matrix quality. I
        modified the ex52f.F to do the test. There are 6 matrices and
        right-hand-side vectors. All these matrices and rhs are from
        my reactive transport simulation. Results will be quite
        different depending on which one you use to do factorization.
        Results will also be different if you run with different
        options. My code is similar to the First or the Second test
        below. When the matrix is well conditioned, it works fine.
        But if the initial matrix is well conditioned, it likely to
        crash when the matrix become ill-conditioned. Since most of
        my case are well conditioned so I didn't detect the problem
        before. This case is a special one.


        How can I avoid this problem? Shall I redo factorization? Can
        PETSc automatically detect this prolbem or is there any
        option available to do this?

        All the data and test code (modified ex52f) can be found via
        the dropbox link below.
        _
        
__https://www.dropbox.com/s/4al1a60creogd8m/petsc-superlu-test.tar.gz?dl=0_


        Summary of my test is shown below.

        First, use the Matrix 1 to setup KSP solver and
        factorization, then solve 168 to 172

        mpiexec.hydra -n 1 ./ex52f -f0
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
        -rhs
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
        -loop_matrices flow_check -loop_folder
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -pc_type
        lu -pc_factor_mat_solver_package superlu_dist

        Norm of error  3.8815E-11 iterations 1
         -->Test for matrix          168
        Norm of error  4.2307E-01 iterations 32
         -->Test for matrix          169
        Norm of error  3.0528E-01 iterations 32
         -->Test for matrix          170
        Norm of error  3.1177E-01 iterations 32
         -->Test for matrix          171
        Norm of error  3.2793E-01 iterations 32
         -->Test for matrix          172
        Norm of error  3.1251E-01 iterations 31

        Second, use the Matrix 1 to setup KSP solver and
        factorization using the implemented SuperLU relative codes. I
        thought this will generate the same results as the First
        test, but it actually not.

        mpiexec.hydra -n 1 ./ex52f -f0
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_1.bin
        -rhs
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_1.bin
        -loop_matrices flow_check -loop_folder
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
        -superlu_default

        Norm of error  2.2632E-12 iterations 1
         -->Test for matrix          168
        Norm of error  1.0817E+04 iterations 1
         -->Test for matrix          169
        Norm of error  1.0786E+04 iterations 1
         -->Test for matrix          170
        Norm of error  1.0792E+04 iterations 1
         -->Test for matrix          171
        Norm of error  1.0792E+04 iterations 1
         -->Test for matrix          172
        Norm of error  1.0792E+04 iterations 1


        Third, use the Matrix 168 to setup KSP solver and
        factorization, then solve 168 to 172

        mpiexec.hydra -n 1 ./ex52f -f0
        
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
        -rhs
        
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
        -loop_matrices flow_check -loop_folder
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin -pc_type
        lu -pc_factor_mat_solver_package superlu_dist

        Norm of error  9.5528E-10 iterations 1
         -->Test for matrix          168
        Norm of error  9.4945E-10 iterations 1
         -->Test for matrix          169
        Norm of error  6.4279E-10 iterations 1
         -->Test for matrix          170
        Norm of error  7.4633E-10 iterations 1
         -->Test for matrix          171
        Norm of error  7.4863E-10 iterations 1
         -->Test for matrix          172
        Norm of error  8.9701E-10 iterations 1

        Fourth, use the Matrix 168 to setup KSP solver and
        factorization using the implemented SuperLU relative codes. I
        thought this will generate the same results as the Third
        test, but it actually not.

        mpiexec.hydra -n 1 ./ex52f -f0
        
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/a_flow_check_168.bin
        -rhs
        
/home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin/b_flow_check_168.bin
        -loop_matrices flow_check -loop_folder
        /home/dsu/work/petsc-superlu-test/matrix_and_rhs_bin
        -superlu_default

        Norm of error  3.7017E-11 iterations 1
         -->Test for matrix          168
        Norm of error  3.6420E-11 iterations 1
         -->Test for matrix          169
        Norm of error  3.7184E-11 iterations 1
         -->Test for matrix          170
        Norm of error  3.6847E-11 iterations 1
         -->Test for matrix          171
        Norm of error  3.7883E-11 iterations 1
         -->Test for matrix          172
        Norm of error  3.8805E-11 iterations 1

        Thanks very much,

        Danyang

        On 15-12-03 01:59 PM, Hong wrote:
        Danyang :
        Further testing a_flow_check_168.bin,
        ./ex10 -f0
        /Users/Hong/Downloads/matrix_and_rhs_bin/a_flow_check_168.bin -rhs
        /Users/Hong/Downloads/matrix_and_rhs_bin/x_flow_check_168.bin -pc_type
        lu -pc_factor_mat_solver_package superlu
        -ksp_monitor_true_residual -mat_superlu_conditionnumber
        Recip. condition number = 1.610480e-12
          0 KSP preconditioned resid norm 6.873340313547e+09 true
        resid norm 7.295020990196e+03 ||r(i)||/||b|| 1.000000000000e+00
          1 KSP preconditioned resid norm 2.051833296449e-02 true
        resid norm 2.976859070118e-02 ||r(i)||/||b|| 4.080672384793e-06
        Number of iterations =   1
        Residual norm 0.0297686

        condition number of this matrix = 1/1.610480e-12 = 1.e+12,
        i.e., this matrix is ill-conditioned.

        Hong


            Hi Hong,

            The binary format of matrix, rhs and solution can be
            downloaded via the link below.

            
https://www.dropbox.com/s/cl3gfi0s0kjlktf/matrix_and_rhs_bin.tar.gz?dl=0

            Thanks,

            Danyang


            On 15-12-03 10:50 AM, Hong wrote:
            Danyang:



                To my surprising, solutions from SuperLU at
                timestep 29 seems not correct for the first 4
                Newton iterations, but the solutions from iteration
                solver and MUMPS are correct.

                Please find all the matrices, rhs and solutions at
                timestep 29 via the link below. The data is a bit
                large so that I just share it through Dropbox. A
                piece of matlab code to read these data and then
                computer the norm has also been attached.
                
_https://www.dropbox.com/s/rr8ueysgflmxs7h/results-check.tar.gz?dl=0_


            Can you send us matrix in petsc binary format?

            e.g., call MatView(M,
            PETSC_VIEWER_BINARY_(PETSC_COMM_WORLD))
            or '-ksp_view_mat binary'

            Hong



                Below is a summary of the norm from the three
                solvers at timestep 29, newton iteration 1 to 5.

                Timestep 29
                Norm of residual seq 1.661321e-09, superlu
                1.657103e+04, mumps 3.731225e-11
                Norm of residual seq 1.753079e-09, superlu
                6.675467e+02, mumps 1.509919e-13
                Norm of residual seq 4.914971e-10, superlu
                1.236362e-01, mumps 2.139303e-17
                Norm of residual seq 3.532769e-10, superlu
                1.304670e-04, mumps 5.387000e-20
                Norm of residual seq 3.885629e-10, superlu
                2.754876e-07, mumps 4.108675e-21

                Would anybody please check if SuperLU can solve
                these matrices? Another possibility is that
                something is wrong in my own code. But so far, I
                cannot find any problem in my code since the same
                code works fine if I using iterative solver or
                direct solver MUMPS. But for other cases I have
                tested, all these solvers work fine.

                Please let me know if I did not write down the
                problem clearly.

                Thanks,

                Danyang











Reply via email to