Simone,

     This is because we are trying to send messages too long for MPI to handle. 
This is a problem for MPI for two reasons

1) MPI "count" arguments are always int, when we use 64 bit PetscInt (because 
of the --with-64-bit-indices PetscInt becomes long long int) this means we 
"may" be passing values too large as count values to MPI and because C/C++ 
automatically castes long long int arguments to int it ends up passing garbage 
values to the MPI libraries.  Now I say "may" because this is only a problem if 
a count happens to be so large it won't fit in an int.

2) Even if the "count" values passed to MPI are correct int values, we've found 
that none of the MPI implementations handle "counts" correctly when they are 
within a factor of 4 or 8 of the largest value allowed in an int. This is 
because the MPI implementations improperly do things like convert from count to 
byte size by multiplying by sizeof(the type being passed) and store the result 
in an int (where it won't fit). We've harassed the MPICH folks about this but 
they consider it a low priority to fix.

  In a few places in PETSc where it uses MPI calls we have started to be very 
careful and make sure that we only use PetscMPIInt as count arguments to MPI 
calls and explicitly check that we can caste from PetscInt to PetscMPIInt and 
generate an error if the result won't fit. We also replace a single call to 
MPI_Send() and MPI_Recv() with our own routines MPILong_Send() and 
MPILong_Recv() that make several calls to MPI_Send() and MPI_Recv() each 
sufficiently small enough for MPI to handle.
For example in MatView_MPIAIJ_Binary() we've updated the code to handle 
absurdly large matrices that cannot use the MPI calls directly.

  I will update the viewer and loader for MPIDense matrices to work correctly, 
but you will have to test it in petsc-dev (not petsc-3.1) Also, I have no 
machines with enough memory to do proper testing so you will need to test the 
code for me.


   Barry




On Jun 3, 2011, at 9:31 AM, Simone Re wrote:

> Dear Experts,
>                I'm facing an issue when saving an MPI dense matrix.
> 
> My matrix has:
> 
> -          5085 rows
> 
> -          737352 columns
> and the crash occurs when I run the program using 12 CPUs (for instance with 
> 16 CPUs everything is fine).
> 
> I built my program using both mvapich2 and Intel MPI 4 and it crashes in both 
> cases.
> 
> When I run my original program built against Intel MPI 4 I get the following.
> 
> [4]PETSC ERROR: MatView_MPIDense_Binary() line 658 in 
> src/mat/impls/dense/mpi/mpidense.c
> [4]PETSC ERROR: MatView_MPIDense() line 780 in 
> src/mat/impls/dense/mpi/mpidense.c
> [4]PETSC ERROR: MatView() line 717 in src/mat/interface/matrix.c
> [4]PETSC ERROR: 
> ------------------------------------------------------------------------
> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> probably memory access out of range
> [4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [4]PETSC ERROR: or see 
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[4]PETSC
>  ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find 
> memory corruption errors
> [4]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
> [4]PETSC ERROR: to get more information on the crash.
> [4]PETSC ERROR: --------------------- Error Message 
> ------------------------------------
> [4]PETSC ERROR: Signal received!
> [4]PETSC ERROR: 
> ------------------------------------------------------------------------
> [4]PETSC ERROR: Petsc Release Version 3.1.0, Patch 7, Mon Dec 20 14:26:37 CST 
> 2010
> [4]PETSC ERROR: See docs/changes/index.html for recent updates.
> [4]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [4]PETSC ERROR: See docs/index.html for manual pages.
> [4]PETSC ERROR: 
> ------------------------------------------------------------------------
> ...
> 
> Unfortunately, when I run the sample program attached, I get the crash but I 
> don't get the same error message.
> I've attached also:
> 
> -           the error I get from the sample program (built using mvapich2)
> 
> -          configure.log
> 
> -          the command line I used to invoke the sample program
> 
> Thanks and regards,
>                Simone Re
> 
> Simone Re
> Team Leader
> Integrated EM Center of Excellence
> WesternGeco GeoSolutions
> via Celeste Clericetti 42/A
> 20133 Milano - Italy
> +39 02 . 266 . 279 . 246   (direct)
> +39 02 . 266 . 279 . 279   (fax)
> sre at slb.com<mailto:sre at slb.com>
> 
> 
> Dear Experts,
> 
>                 I?m facing an issue when saving an MPI dense matrix.
> 
>  
> 
> My matrix has:
> 
> -          5085 rows
> 
> -          737352 columns
> 
> and the crash occurs when I run the program using 12 CPUs (for instance with 
> 16 CPUs everything is fine).
> 
>  
> 
> I built my program using both mvapich2 and Intel MPI 4 and it crashes in both 
> cases.
> 
>  
> 
> When I run my original program built against Intel MPI 4 I get the following.
> 
>  
> 
> [4]PETSC ERROR: MatView_MPIDense_Binary() line 658 in 
> src/mat/impls/dense/mpi/mpidense.c
> 
> [4]PETSC ERROR: MatView_MPIDense() line 780 in 
> src/mat/impls/dense/mpi/mpidense.c
> 
> [4]PETSC ERROR: MatView() line 717 in src/mat/interface/matrix.c
> 
> [4]PETSC ERROR: 
> ------------------------------------------------------------------------
> 
> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> probably memory access out of range
> 
> [4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> 
> [4]PETSC ERROR: or see 
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[4]PETSC
>  ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find 
> memory corruption errors
> 
> [4]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
> 
> [4]PETSC ERROR: to get more information on the crash.
> 
> [4]PETSC ERROR: --------------------- Error Message 
> ------------------------------------
> 
> [4]PETSC ERROR: Signal received!
> 
> [4]PETSC ERROR: 
> ------------------------------------------------------------------------
> 
> [4]PETSC ERROR: Petsc Release Version 3.1.0, Patch 7, Mon Dec 20 14:26:37 CST 
> 2010
> 
> [4]PETSC ERROR: See docs/changes/index.html for recent updates.
> 
> [4]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> 
> [4]PETSC ERROR: See docs/index.html for manual pages.
> 
> [4]PETSC ERROR: 
> ------------------------------------------------------------------------
> 
> ?
> 
>  
> 
> Unfortunately, when I run the sample program attached, I get the crash but I 
> don?t get the same error message.
> 
> I?ve attached also:
> 
> -           the error I get from the sample program (built using mvapich2)
> 
> -          configure.log
> 
> -          the command line I used to invoke the sample program
> 
>  
> 
> Thanks and regards,
> 
>                 Simone Re
> 
>  
> 
> Simone Re
> 
> Team Leader
> 
> Integrated EM Center of Excellence
> 
> 
> WesternGeco GeoSolutions
> 
> via Celeste Clericetti 42/A
> 
> 20133 Milano - Italy
> 
> 
> +39 02 . 266 . 279 . 246   (direct)
> 
> +39 02 . 266 . 279 . 279   (fax)
> 
> sre at slb.com
> 
>  
> 
> <for_petsc_team.tar.bz2>


Reply via email to