Thank you Sherry for your efforts
but before I can setup an example that reproduces the problem, I have to
ask PETSc related question.
When I pump matrix via MatView MatLoad it ignores its original partitioning.
Say originally I have 100 and 110 equations on two processors, after
MatLoad I will have 105 and 105 also on two processors.
What do I do to pass partitioning info through MatView MatLoad?
I guess it's important for reproducing my setup exactly.
Thanks
On 10/19/2016 08:06 AM, Xiaoye S. Li wrote:
I looked at each valgrind-complained item in your email dated Oct.
11. Those reports are really superficial; I don't see anything wrong
with those lines (mostly uninitialized variables) singled out. I did
a few tests with the latest version in github, all went fine.
Perhaps you can print your matrix that caused problem, I can run it
using your matrix.
Sherry
On Tue, Oct 11, 2016 at 2:18 PM, Anton <po...@uni-mainz.de
<mailto:po...@uni-mainz.de>> wrote:
On 10/11/16 7:19 PM, Satish Balay wrote:
This log looks truncated. Are there any valgrind mesages
before this?
[like from your application code - or from MPI]
Yes it is indeed truncated. I only included relevant messages.
Perhaps you can send the complete log - with:
valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
--track-origins=yes
[and if there were more valgrind messages from MPI - rebuild petsc
There are no messages originating from our code, just a few MPI
related ones (probably false positives) and from SuperLU_DIST
(most of them).
Thanks,
Anton
with --download-mpich - for a valgrind clean mpi]
Sherry,
Perhaps this log points to some issue in superlu_dist?
thanks,
Satish
On Tue, 11 Oct 2016, Anton Popov wrote:
Valgrind immediately detects interesting stuff:
==25673== Use of uninitialised value of size 8
==25673== at 0x178272C: static_schedule
(static_schedule.c:960)
==25674== Use of uninitialised value of size 8
==25674== at 0x178272C: static_schedule
(static_schedule.c:960)
==25674== by 0x174E74E: pdgstrf (pdgstrf.c:572)
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25673== Conditional jump or move depends on
uninitialised value(s)
==25673== at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25673== Conditional jump or move depends on
uninitialised value(s)
==25673== at 0x5C83F43: PMPI_Recv (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25673== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25673== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674== Use of uninitialised value of size 8
==25674== at 0x62BF72B: _itoa_word (_itoa.c:179)
==25674== by 0x62C1289: printf_positional (vfprintf.c:2022)
==25674== by 0x62C2465: vfprintf (vfprintf.c:1677)
==25674== by 0x638AFD5: __vsnprintf_chk
(vsnprintf_chk.c:63)
==25674== by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
==25674== by 0x5CC6C08: MPIR_Err_create_code_valist (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674== by 0x5CC7A9A: MPIR_Err_create_code (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674== by 0x5C83FB1: PMPI_Recv (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25674== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674== Use of uninitialised value of size 8
==25674== at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
And it crashes after this:
==25674== Invalid write of size 4
==25674== at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674== by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
(superlu_dist.c:421)
==25674== Address 0xa0 is not stack'd, malloc'd or
(recently) free'd
==25674==
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
Violation, probably
memory access out of range
On 10/11/2016 03:26 PM, Anton Popov wrote:
On 10/10/2016 07:11 PM, Satish Balay wrote:
Thats from petsc-3.5
Anton - please post the stack trace you get with
--download-superlu_dist-commit=origin/maint
I guess this is it:
[0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST
line 282
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
/home/anton/LIB/petsc/src/mat/interface/matrix.c
[0]PETSC ERROR: [0] PCSetUp_LU line 101
/home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
[0]PETSC ERROR: [0] PCSetUp line 930
/home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
According to the line numbers it crashes within
MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
Surprisingly this only happens on the second SNES
iteration, but not on the
first.
I'm trying to reproduce this behavior with PETSc KSP
and SNES examples.
However, everything I've tried up to now with
SuperLU_DIST does just fine.
I'm also checking our code in Valgrind to make sure
it's clean.
Anton
Satish
On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
Which version of superlu_dist does this
capture? I looked at the
original
error log, it pointed to pdgssvx: line 161.
But that line is in
comment
block, not the program.
Sherry
On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov
<po...@uni-mainz.de
<mailto:po...@uni-mainz.de>> wrote:
On 10/07/2016 05:23 PM, Satish Balay wrote:
On Fri, 7 Oct 2016, Kong, Fande wrote:
On Fri, Oct 7, 2016 at 9:04 AM, Satish
Balay <ba...@mcs.anl.gov
<mailto:ba...@mcs.anl.gov>>
wrote:
On Fri, 7 Oct 2016, Anton Popov wrote:
Hi guys,
are there any news about
fixing buggy behavior of
SuperLU_DIST, exactly
what
is described here:
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists>.
mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
l&d=CwIBAg&c=
54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e=
?
I'm using 3.7.4 and still
get SEGV in pdgssvx routine.
Everything works
fine
with 3.5.4.
Do I still have to stick
to maint branch, and what
are the
chances for
these
fixes to be included in 3.7.5?
3.7.4. is off maint branch [as
of a week ago]. So if you are
seeing
issues with it - its best to
debug and figure out the cause.
This bug is indeed inside of
superlu_dist, and we started
having
this
issue
from PETSc-3.6.x. I think
superlu_dist developers should have
fixed this
bug. We forgot to update
superlu_dist?? This is not a thing
users
could
debug and fix.
I have many people in INL
suffering from this issue, and
they have
to
stay
with PETSc-3.5.4 to use superlu_dist.
To verify if the bug is fixed in
latest superlu_dist - you can try
[assuming you have git - either from
petsc-3.7/maint/master]:
--download-superlu_dist
--download-superlu_dist-commit=origin/maint
Satish
Hi Satish,
I did this:
git clone -b maint
https://bitbucket.org/petsc/petsc.git
<https://bitbucket.org/petsc/petsc.git> petsc
--download-superlu_dist
--download-superlu_dist-commit=origin/maint
(not sure this is needed,
since I'm already in maint)
The problem is still there.
Cheers,
Anton