Thank you Sherry for your efforts

but before I can setup an example that reproduces the problem, I have to ask PETSc related question.

When I pump matrix via MatView MatLoad it ignores its original partitioning.

Say originally I have 100 and 110 equations on two processors, after MatLoad I will have 105 and 105 also on two processors.

What do I do to pass partitioning info through MatView MatLoad?

I guess it's important for reproducing my setup exactly.

Thanks


On 10/19/2016 08:06 AM, Xiaoye S. Li wrote:
I looked at each valgrind-complained item in your email dated Oct. 11. Those reports are really superficial; I don't see anything wrong with those lines (mostly uninitialized variables) singled out. I did a few tests with the latest version in github, all went fine.

Perhaps you can print your matrix that caused problem, I can run it using your matrix.

Sherry


On Tue, Oct 11, 2016 at 2:18 PM, Anton <po...@uni-mainz.de <mailto:po...@uni-mainz.de>> wrote:



    On 10/11/16 7:19 PM, Satish Balay wrote:

        This log looks truncated. Are there any valgrind mesages
        before this?
        [like from your application code - or from MPI]

    Yes it is indeed truncated. I only included relevant messages.


        Perhaps you can send the complete log - with:
        valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
        --track-origins=yes

        [and if there were more valgrind messages from MPI - rebuild petsc

    There are no messages originating from our code, just a few MPI
    related ones (probably false positives) and from SuperLU_DIST
    (most of them).

    Thanks,
    Anton

        with --download-mpich - for a valgrind clean mpi]

        Sherry,
        Perhaps this log points to some issue in superlu_dist?

        thanks,
        Satish

        On Tue, 11 Oct 2016, Anton Popov wrote:

            Valgrind immediately detects interesting stuff:

            ==25673== Use of uninitialised value of size 8
            ==25673==    at 0x178272C: static_schedule
            (static_schedule.c:960)
            ==25674== Use of uninitialised value of size 8
            ==25674==    at 0x178272C: static_schedule
            (static_schedule.c:960)
            ==25674==    by 0x174E74E: pdgstrf (pdgstrf.c:572)
            ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)


            ==25673== Conditional jump or move depends on
            uninitialised value(s)
            ==25673==    at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
            ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)


            ==25673== Conditional jump or move depends on
            uninitialised value(s)
            ==25673==    at 0x5C83F43: PMPI_Recv (in
            /opt/mpich3/lib/libmpi.so.12.1.0)
            ==25673==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
            ==25673==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
            ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)

            ==25674== Use of uninitialised value of size 8
            ==25674==    at 0x62BF72B: _itoa_word (_itoa.c:179)
            ==25674==    by 0x62C1289: printf_positional (vfprintf.c:2022)
            ==25674==    by 0x62C2465: vfprintf (vfprintf.c:1677)
            ==25674==    by 0x638AFD5: __vsnprintf_chk
            (vsnprintf_chk.c:63)
            ==25674==    by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
            ==25674==    by 0x5CC6C08: MPIR_Err_create_code_valist (in
            /opt/mpich3/lib/libmpi.so.12.1.0)
            ==25674==    by 0x5CC7A9A: MPIR_Err_create_code (in
            /opt/mpich3/lib/libmpi.so.12.1.0)
            ==25674==    by 0x5C83FB1: PMPI_Recv (in
            /opt/mpich3/lib/libmpi.so.12.1.0)
            ==25674==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
            ==25674==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
            ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)

            ==25674== Use of uninitialised value of size 8
            ==25674==    at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
            ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)

            And it crashes after this:

            ==25674== Invalid write of size 4
            ==25674==    at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
            ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
            ==25674==    by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
            (superlu_dist.c:421)
            ==25674==  Address 0xa0 is not stack'd, malloc'd or
            (recently) free'd
            ==25674==
            [1]PETSC ERROR:
            
------------------------------------------------------------------------
            [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
            Violation, probably
            memory access out of range


            On 10/11/2016 03:26 PM, Anton Popov wrote:

                On 10/10/2016 07:11 PM, Satish Balay wrote:

                    Thats from petsc-3.5

                    Anton - please post the stack trace you get with
                    --download-superlu_dist-commit=origin/maint

                I guess this is it:

                [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
                
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
                [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST
                line 282
                
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
                [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
                /home/anton/LIB/petsc/src/mat/interface/matrix.c
                [0]PETSC ERROR: [0] PCSetUp_LU line 101
                /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
                [0]PETSC ERROR: [0] PCSetUp line 930
                /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c

                According to the line numbers it crashes within
                MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.

                Surprisingly this only happens on the second SNES
                iteration, but not on the
                first.

                I'm trying to reproduce this behavior with PETSc KSP
                and SNES examples.
                However, everything I've tried up to now with
                SuperLU_DIST does just fine.

                I'm also checking our code in Valgrind to make sure
                it's clean.

                Anton

                    Satish


                    On Mon, 10 Oct 2016, Xiaoye S. Li wrote:

                        Which version of superlu_dist does this
                        capture?   I looked at the
                        original
                        error  log, it pointed to pdgssvx: line 161.
                        But that line is in
                        comment
                        block, not the program.

                        Sherry


                        On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov
                        <po...@uni-mainz.de
                        <mailto:po...@uni-mainz.de>> wrote:

                            On 10/07/2016 05:23 PM, Satish Balay wrote:

                                On Fri, 7 Oct 2016, Kong, Fande wrote:

                                On Fri, Oct 7, 2016 at 9:04 AM, Satish
                                Balay <ba...@mcs.anl.gov
                                <mailto:ba...@mcs.anl.gov>>
                                wrote:

                                    On Fri, 7 Oct 2016, Anton Popov wrote:

                                        Hi guys,

                                            are there any news about
                                            fixing buggy behavior of
                                            SuperLU_DIST, exactly

                                        what

                                            is described here:

                                            
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists
                                            
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists>.

                                        
mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
                                        l&d=CwIBAg&c=
                                        
54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
                                        
JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
                                        
1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e=
                                        ?

                                            I'm using 3.7.4 and still
                                            get SEGV in pdgssvx routine.
                                            Everything works

                                        fine

                                            with 3.5.4.

                                            Do I still have to stick
                                            to maint branch, and what
                                            are the
                                            chances for

                                        these

                                            fixes to be included in 3.7.5?

                                        3.7.4. is off maint branch [as
                                        of a week ago]. So if you are
                                        seeing
                                        issues with it - its best to
                                        debug and figure out the cause.

                                        This bug is indeed inside of
                                        superlu_dist, and we started
                                        having
                                        this

                                    issue
                                    from PETSc-3.6.x. I think
                                    superlu_dist developers should have
                                    fixed this
                                    bug. We forgot to update
                                    superlu_dist?? This is not a thing
                                    users
                                    could
                                    debug and fix.

                                    I have many people in INL
                                    suffering from this issue, and
                                    they have
                                    to
                                    stay
                                    with PETSc-3.5.4 to use superlu_dist.

                                To verify if the bug is fixed in
                                latest superlu_dist - you can try
                                [assuming you have git - either from
                                petsc-3.7/maint/master]:

                                --download-superlu_dist
                                --download-superlu_dist-commit=origin/maint


                                Satish

                                Hi Satish,

                            I did this:

                            git clone -b maint
                            https://bitbucket.org/petsc/petsc.git
                            <https://bitbucket.org/petsc/petsc.git> petsc

                            --download-superlu_dist
                            --download-superlu_dist-commit=origin/maint
                            (not sure this is needed,
                            since I'm already in maint)

                            The problem is still there.

                            Cheers,
                            Anton






Reply via email to