Sherry,
One minor issue with the tarball. I see the following new files in the
v4.1 tarball
[when comparing it with v4.0]. Some of these files are perhaps junk files
- and can
be removed from the tarball?
EXAMPLE/dscatter.c.bak
EXAMPLE/g10.cua
EXAMPLE/g4.cua
EXAMPLE/g4.postorder.eps
EXAMPLE/g4.rua
EXAMPLE/g4_postorder.jpg
EXAMPLE/hostname
EXAMPLE/pdgssvx.c
EXAMPLE/pdgstrf2.c
EXAMPLE/pwd
EXAMPLE/pzgstrf2.c
EXAMPLE/pzgstrf_v3.3.c
EXAMPLE/pzutil.c
EXAMPLE/test.bat
EXAMPLE/test.cpu.bat
EXAMPLE/test.err
EXAMPLE/test.err.1
EXAMPLE/zlook_ahead_update.c
FORTRAN/make.out
FORTRAN/zcreate_dist_matrix.c
MAKE_INC/make.xc30
SRC/int_t
SRC/lnbrow
SRC/make.out
SRC/rnbrow
SRC/temp
SRC/temp1
Thanks,
Satish
On Tue, 28 Jul 2015, Xiaoye S. Li wrote:
I am checking v4.1 now. I'll let you know when I fixed the problem.
Sherry
On Tue, Jul 28, 2015 at 8:27 AM, Hong <[email protected]> wrote:
Sherry,
I tested with superlu_dist v4.1. The extra printings are gone, but
hang
remains.
It hangs at
#5 0x00007fde5af1c818 in PMPI_Wait (request=0xb6e4e0,
status=0x7fff9cd83d60)
at src/mpi/pt2pt/wait.c:168
#6 0x00007fde602dd635 in pzgstrf (options=0x9202f0, m=4900, n=4900,
anorm=13.738475134194639, LUstruct=0x9203c8, grid=0x9202c8,
stat=0x7fff9cd84880, info=0x7fff9cd848bc) at pzgstrf.c:1308
if (recv_req[0] != MPI_REQUEST_NULL) {
--> MPI_Wait (&recv_req[0], &status);
We will update petsc interface to superlu_dist v4.1.
Hong
On Mon, Jul 27, 2015 at 11:33 PM, Xiaoye S. Li <[email protected]> wrote:
Hong,
Thanks for trying out.
The extra printings are not properly guarded by the print level. I
will
fix that. I will look into the hang problem soon.
Sherry
On Mon, Jul 27, 2015 at 7:50 PM, Hong <[email protected]> wrote:
Sherry,
I can repeat hang using petsc/src/ksp/ksp/examples/tutorials/ex10.c:
mpiexec -n 4 ./ex10 -f0 /homes/hzhang/tmp/Amat_binary.m -rhs 0
-pc_type
lu -pc_factor_mat_solver_package superlu_dist
-mat_superlu_dist_parsymbfact
...
.. Starting with 1 OpenMP threads
[0] .. BIG U size 1342464
[0] .. BIG V size 131072
Max row size is 1311
Using buffer_size of 5000000
Threads per process 1
...
using a debugger (with petsc option '-start_in_debugger'), I find
that
hang occurs at
#0 0x00007f117d870998 in __GI___poll (fds=0x20da750, nfds=4,
timeout=<optimized out>, timeout@entry=-1)
at ../sysdeps/unix/sysv/linux/poll.c:83
#1 0x00007f117de9f7de in MPIDU_Sock_wait (sock_set=0x20da550,
millisecond_timeout=millisecond_timeout@entry=-1,
eventp=eventp@entry=0x7fff654930b0)
at src/mpid/common/sock/poll/sock_wait.i:123
#2 0x00007f117de898b8 in MPIDI_CH3i_Progress_wait (
progress_state=0x7fff65493120)
at src/mpid/ch3/channels/sock/src/ch3_progress.c:218
#3 MPIDI_CH3I_Progress (blocking=blocking@entry=1,
state=state@entry=0x7fff65493120)
at src/mpid/ch3/channels/sock/src/ch3_progress.c:921
#4 0x00007f117de1a559 in MPIR_Wait_impl (request=request@entry
=0x262df90,
status=status@entry=0x7fff65493390) at src/mpi/pt2pt/wait.c:67
#5 0x00007f117de1a818 in PMPI_Wait (request=0x262df90,
status=0x7fff65493390)
at src/mpi/pt2pt/wait.c:168
#6 0x00007f11831da557 in pzgstrf (options=0x23dfda0, m=4900,
n=4900,
anorm=13.738475134194639, LUstruct=0x23dfe78, grid=0x23dfd78,
stat=0x7fff65493ea0, info=0x7fff65493edc) at pzgstrf.c:1308
#7 0x00007f11831bf3bd in pzgssvx (options=0x23dfda0, A=0x23dfe30,
ScalePermstruct=0x23dfe50, B=0x0, ldb=1225, nrhs=0,
grid=0x23dfd78,
LUstruct=0x23dfe78, SOLVEstruct=0x23dfe98, berr=0x0,
stat=0x7fff65493ea0,
---Type <return> to continue, or q <return> to quit---
info=0x7fff65493edc) at pzgssvx.c:1063
#8 0x00007f11825c2340 in MatLUFactorNumeric_SuperLU_DIST
(F=0x23a0110,
A=0x21bb7e0, info=0x2355068)
at
/sandbox/hzhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:411
#9 0x00007f1181c6c567 in MatLUFactorNumeric (fact=0x23a0110,
mat=0x21bb7e0,
info=0x2355068) at
/sandbox/hzhang/petsc/src/mat/interface/matrix.c:2946
#10 0x00007f1182a56489 in PCSetUp_LU (pc=0x2353a10)
at /sandbox/hzhang/petsc/src/ksp/pc/impls/factor/lu/lu.c:152
#11 0x00007f1182b16f24 in PCSetUp (pc=0x2353a10)
at /sandbox/hzhang/petsc/src/ksp/pc/interface/precon.c:983
#12 0x00007f1182be61b5 in KSPSetUp (ksp=0x232c2a0)
at /sandbox/hzhang/petsc/src/ksp/ksp/interface/itfunc.c:332
#13 0x0000000000405a31 in main (argc=11, args=0x7fff65499578)
at
/sandbox/hzhang/petsc/src/ksp/ksp/examples/tutorials/ex10.c:312
You may take a look at it. Sequential symbolic factorization works
fine.
Why superlu_dist (v4.0) in complex precision displays
.. Starting with 1 OpenMP threads
[0] .. BIG U size 1342464
[0] .. BIG V size 131072
Max row size is 1311
Using buffer_size of 5000000
Threads per process 1
...
I realize that I use superlu_dist v4.0. Would v4.1 works? I'll give
it a
try tomorrow.
Hong
On Mon, Jul 27, 2015 at 1:25 PM, Anthony Paul Haas <
[email protected]> wrote:
Hi Hong,
No that is not the correct matrix. Note that I forgot to mention
that
it is a complex matrix. I tried loading the matrix I sent you this
morning
with:
!...Load a Matrix in Binary Format
call
PetscViewerBinaryOpen(PETSC_COMM_WORLD,"Amat_binary.m",FILE_MODE_READ,viewer,ierr)
call MatCreate(PETSC_COMM_WORLD,DLOAD,ierr)
call MatSetType(DLOAD,MATAIJ,ierr)
call MatLoad(DLOAD,viewer,ierr)
call PetscViewerDestroy(viewer,ierr)
call MatView(DLOAD,PETSC_VIEWER_STDOUT_WORLD,ierr)
The first 37 rows should look like this:
Mat Object: 2 MPI processes
type: mpiaij
row 0: (0, 1)
row 1: (1, 1)
row 2: (2, 1)
row 3: (3, 1)
row 4: (4, 1)
row 5: (5, 1)
row 6: (6, 1)
row 7: (7, 1)
row 8: (8, 1)
row 9: (9, 1)
row 10: (10, 1)
row 11: (11, 1)
row 12: (12, 1)
row 13: (13, 1)
row 14: (14, 1)
row 15: (15, 1)
row 16: (16, 1)
row 17: (17, 1)
row 18: (18, 1)
row 19: (19, 1)
row 20: (20, 1)
row 21: (21, 1)
row 22: (22, 1)
row 23: (23, 1)
row 24: (24, 1)
row 25: (25, 1)
row 26: (26, 1)
row 27: (27, 1)
row 28: (28, 1)
row 29: (29, 1)
row 30: (30, 1)
row 31: (31, 1)
row 32: (32, 1)
row 33: (33, 1)
row 34: (34, 1)
row 35: (35, 1)
row 36: (1, -41.2444) (35, -41.2444) (36, 118.049 - 0.999271 i)
(37,
-21.447) (38, 5.18873) (39, -2.34856) (40, 1.3607) (41,
-0.898206)
(42, 0.642715) (43, -0.48593) (44, 0.382471) (45, -0.310476)
(46,
0.258302) (47, -0.219268) (48, 0.189304) (49, -0.165815) (50,
0.147076) (51, -0.131907) (52, 0.119478) (53, -0.109189) (54,
0.1006)
(55, -0.0933795) (56, 0.0872779) (57, -0.0821019) (58,
0.0777011) (59,
-0.0739575) (60, 0.0707775) (61, -0.0680868) (62, 0.0658258)
(63,
-0.0639473) (64, 0.0624137) (65, -0.0611954) (66, 0.0602698)
(67,
-0.0596202) (68, 0.0592349) (69, -0.0295536) (71, -21.447)
(106,
5.18873) (141, -2.34856) (176, 1.3607) (211, -0.898206) (246,
0.642715) (281, -0.48593) (316, 0.382471) (351, -0.310476)
(386,
0.258302) (421, -0.219268) (456, 0.189304) (491, -0.165815)
(526,
0.147076) (561, -0.131907) (596, 0.119478) (631, -0.109189)
(666,
0.1006) (701, -0.0933795) (736, 0.0872779) (771, -0.0821019)
(806,
0.0777011) (841, -0.0739575) (876, 0.0707775) (911,
-0.0680868) (946,
0.0658258) (981, -0.0639473) (1016, 0.0624137) (1051,
-0.0611954)
(1086, 0.0602698) (1121, -0.0596202) (1156, 0.0592349) (1191,
-0.0295536) (1261, 0) (3676, 117.211) (3711, -58.4801) (3746,
-78.3633) (3781, 29.4911) (3816, -15.8073) (3851, 9.94324)
(3886,
-6.87205) (3921, 5.05774) (3956, -3.89521) (3991, 3.10522)
(4026,
-2.54388) (4061, 2.13082) (4096, -1.8182) (4131, 1.57606)
(4166,
-1.38491) (4201, 1.23155) (4236, -1.10685) (4271, 1.00428)
(4306,
-0.919116) (4341, 0.847829) (4376, -0.787776) (4411, 0.736933)
(4446,
-0.693735) (4481, 0.656958) (4516, -0.625638) (4551, 0.599007)
(4586,
-0.576454) (4621, 0.557491) (4656, -0.541726) (4691, 0.528849)
(4726,
-0.518617) (4761, 0.51084) (4796, -0.50538) (4831, 0.502142)
(4866,
-0.250534)
Thanks,
Anthony
On Fri, Jul 24, 2015 at 7:56 PM, Hong <[email protected]> wrote:
Anthony:
I test your Amat_binary.m
using petsc/src/ksp/ksp/examples/tutorials/ex10.c.
Your matrix has many zero rows:
./ex10 -f0 ~/tmp/Amat_binary.m -rhs 0 -mat_view |more
Mat Object: 1 MPI processes
type: seqaij
row 0: (0, 1)
row 1: (1, 0)
row 2: (2, 1)
row 3: (3, 0)
row 4: (4, 1)
row 5: (5, 0)
row 6: (6, 1)
row 7: (7, 0)
row 8: (8, 1)
row 9: (9, 0)
...
row 36: (1, 1) (35, 0) (36, 1) (37, 0) (38, 1) (39, 0) (40,
1)
(41, 0) (42, 1) (43, 0) (44, 1) (45,
0) (46, 1) (47, 0) (48, 1) (49, 0) (50, 1) (51, 0) (52, 1)
(53, 0) (54, 1) (55, 0) (56, 1) (57, 0)
(58, 1) (59, 0) (60, 1) ...
Do you send us correct matrix?
I ran my code through valgrind and gdb as suggested by Barry. I
am
now coming back to some problem I have had while running with
parallel
symbolic factorization. I am attaching a test matrix (petsc
binary format)
that I LU decompose and then use to solve a linear system (see
code below).
I can run on 2 processors with parsymbfact or with 4 processors
without
parsymbfact. However, if I run on 4 procs with parsymbfact, the
code is
just hanging. Below is the simplified test case that I have used
to test.
The matrix A and B are built somewhere else in my program. The
matrix I am
attaching is A-sigma*B (see below).
One thing is that I don't know for sparse matrices what is the
optimum number of processors to use for a LU decomposition? Does
it depend
on the total number of nonzero? Do you have an easy way to
compute it?
You have to experiment your matrix on a target machine to find
out.
Hong
Subroutine HowBigLUCanBe(rank)
IMPLICIT NONE
integer(i4b),intent(in) :: rank
integer(i4b) :: i,ct
real(dp) :: begin,endd
complex(dpc) :: sigma
PetscErrorCode ierr
if (rank==0) call cpu_time(begin)
if (rank==0) then
write(*,*)
write(*,*)'Testing How Big LU Can Be...'
write(*,*)'============================'
write(*,*)
endif
sigma = (1.0d0,0.0d0)
call MatAXPY(A,-sigma,B,DIFFERENT_NONZERO_PATTERN,ierr) !
on
exit A = A-sigma*B
!.....Write Matrix to ASCII and Binary Format
!call
PetscViewerASCIIOpen(PETSC_COMM_WORLD,"Amat.m",viewer,ierr)
!call MatView(DXX,viewer,ierr)
!call PetscViewerDestroy(viewer,ierr)
call
PetscViewerBinaryOpen(PETSC_COMM_WORLD,"Amat_binary.m",FILE_MODE_WRITE,viewer,ierr)
call MatView(A,viewer,ierr)
call PetscViewerDestroy(viewer,ierr)
!.....Create Linear Solver Context
call KSPCreate(PETSC_COMM_WORLD,ksp,ierr)
!.....Set operators. Here the matrix that defines the linear
system
also serves as the preconditioning matrix.
!call
KSPSetOperators(ksp,A,A,DIFFERENT_NONZERO_PATTERN,ierr)
!aha commented and replaced by next line
call KSPSetOperators(ksp,A,A,ierr) ! remember: here A =
A-sigma*B
!.....Set Relative and Absolute Tolerances and Uses Default for
Divergence Tol
tol = 1.e-10
call
KSPSetTolerances(ksp,tol,tol,PETSC_DEFAULT_REAL,PETSC_DEFAULT_INTEGER,ierr)
!.....Set the Direct (LU) Solver
call KSPSetType(ksp,KSPPREONLY,ierr)
call KSPGetPC(ksp,pc,ierr)
call PCSetType(pc,PCLU,ierr)
call
PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST,ierr)
! MATSOLVERSUPERLU_DIST MATSOLVERMUMPS
!.....Create Right-Hand-Side Vector
call MatCreateVecs(A,frhs,PETSC_NULL_OBJECT,ierr)
call MatCreateVecs(A,sol,PETSC_NULL_OBJECT,ierr)
allocate(xwork1(IendA-IstartA))
allocate(loc(IendA-IstartA))
ct=0
do i=IstartA,IendA-1
ct=ct+1
loc(ct)=i
xwork1(ct)=(1.0d0,0.0d0)
enddo
call
VecSetValues(frhs,IendA-IstartA,loc,xwork1,INSERT_VALUES,ierr)
call VecZeroEntries(sol,ierr)
deallocate(xwork1,loc)
!.....Assemble Vectors
call VecAssemblyBegin(frhs,ierr)
call VecAssemblyEnd(frhs,ierr)
!.....Solve the Linear System
call KSPSolve(ksp,frhs,sol,ierr)
!call VecView(sol,PETSC_VIEWER_STDOUT_WORLD,ierr)
if (rank==0) then
call cpu_time(endd)
write(*,*)
print '("Total time for HowBigLUCanBe = ",f21.3,"
seconds.")',endd-begin
endif
call SlepcFinalize(ierr)
STOP
end Subroutine HowBigLUCanBe
On 07/08/2015 11:23 AM, Xiaoye S. Li wrote:
Indeed, the parallel symbolic factorization routine needs power
of
2 processes, however, you can use however many processes you
need;
internally, we redistribute matrix to nearest power of 2
processes, do
symbolic, then redistribute back to all the processes to do
factorization,
triangular solve etc. So, there is no restriction from the
users
viewpoint.
It's difficult to tell what the problem is. Do you think you
can
print your matrix, then, I can do some debugging by running
superlu_dist
standalone?
Sherry
On Wed, Jul 8, 2015 at 10:34 AM, Anthony Paul Haas <
[email protected]> wrote:
Hi,
I have used the switch -mat_superlu_dist_parsymbfact in my pbs
script. However, although my program worked fine with
sequential symbolic
factorization, I get one of the following 2 behaviors when I
run with
parallel symbolic factorization (depending on the number of
processors that
I use):
1) the program just hangs (it seems stuck in some subroutine
==>
see test.out-hangs)
2) I get a floating point exception ==> see
test.out-floating-point-exception
Note that as suggested in the Superlu manual, I use a power of
2
number of procs. Are there any tunable parameters for the
parallel symbolic
factorization? Note that when I build my sparse matrix, most
elements I add
are nonzero of course but to simplify the programming, I also
add a few
zero elements in the sparse matrix. I was thinking that maybe
if the
parallel symbolic factorization proceed by block, there could
be some
blocks where the pivot would be zero, hence creating the FPE??
Thanks,
Anthony
On Wed, Jul 8, 2015 at 6:46 AM, Xiaoye S. Li <[email protected]>
wrote:
Did you find out how to change option to use parallel symbolic
factorization? Perhaps PETSc team can help.
Sherry
On Tue, Jul 7, 2015 at 3:58 PM, Xiaoye S. Li <[email protected]>
wrote:
Is there an inquiry function that tells you all the available
options?
Sherry
On Tue, Jul 7, 2015 at 3:25 PM, Anthony Paul Haas <
[email protected]> wrote:
Hi Sherry,
Thanks for your message. I have used superlu_dist default
options. I did not realize that I was doing serial symbolic
factorization.
That is probably the cause of my problem.
Each node on Garnet has 60GB usable memory and I can run
with
1,2,4,8,16 or 32 core per node.
So I should use:
-mat_superlu_dist_r 20
-mat_superlu_dist_c 32
How do you specify the parallel symbolic factorization
option?
is it -mat_superlu_dist_matinput 1
Thanks,
Anthony
On Tue, Jul 7, 2015 at 3:08 PM, Xiaoye S. Li <[email protected]>
wrote:
For superlu_dist failure, this occurs during symbolic
factorization. Since you are using serial symbolic
factorization, it
requires the entire graph of A to be available in the
memory of one MPI
task. How much memory do you have for each MPI task?
It won't help even if you use more processes. You should
try
to use parallel symbolic factorization option.
Another point. You set up process grid as:
Process grid nprow 32 x npcol 20
For better performance, you show swap the grid dimension.
That
is, it's better to use 20 x 32, never gives nprow larger
than npcol.
Sherry
On Tue, Jul 7, 2015 at 1:27 PM, Barry Smith <
[email protected]>
wrote:
I would suggest running a sequence of problems, 101 by
101
111 by 111 etc and get the memory usage in each case (when
you run out of
memory you can get NO useful information out about memory
needs). You can
then plot memory usage as a function of problem size to
get a handle on how
much memory it is using. You can also run on more and
more processes
(which have a total of more memory) to see how large a
problem you may be
able to reach.
MUMPS also has an "out of core" version (which we have
never
used) that could in theory anyways let you get to large
problems if you
have lots of disk space, but you are on your own figuring
out how to use it.
Barry
On Jul 7, 2015, at 2:37 PM, Anthony Paul Haas <
[email protected]> wrote:
Hi Jose,
In my code, I use once PETSc to solve a linear system to
get
the baseflow (without using SLEPc) and then I use SLEPc to
do the stability
analysis of that baseflow. This is why, there are some
SLEPc options that
are not used in test.out-superlu_dist-151x151 (when I am
solving for the
baseflow with PETSc only). I have attached a 101x101 case
for which I get
the eigenvalues. That case works fine. However If i
increase to 151x151, I
get the error that you can see in
test.out-superlu_dist-151x151 (similar
error with mumps: see test.out-mumps-151x151 line 2918 ).
If you look a the
very end of the files test.out-superlu_dist-151x151 and
test.out-mumps-151x151, you will see that the last info
message printed is:
On Processor (after EPSSetFromOptions) 0 memory:
0.65073152000E+08 =====> (see line 807 of
module_petsc.F90)
This means that the memory error probably occurs in the
call
to EPSSolve (see module_petsc.F90 line 810). I would like
to evaluate how
much memory is required by the most memory intensive
operation within
EPSSolve. Since I am solving a generalized EVP, I would
imagine that it
would be the LU decomposition. But is there an accurate
way of doing it?
Before starting with iterative solvers, I would like to
exploit as much as I can direct solvers. I tried GMRES
with default
preconditioner at some point but I had convergence
problem. What
solver/preconditioner would you recommend for a
generalized non-Hermitian
(EPS_GNHEP) EVP?
Thanks,
Anthony
On Tue, Jul 7, 2015 at 12:17 AM, Jose E. Roman <
[email protected]> wrote:
El 07/07/2015, a las 02:33, Anthony Haas escribió:
Hi,
I am computing eigenvalues using PETSc/SLEPc and
superlu_dist for the LU decomposition (my problem is a
generalized
eigenvalue problem). The code runs fine for a grid with
101x101 but when I
increase to 151x151, I get the following error:
Can't expand MemType 1: jcol 16104 (and then [NID
00037]
2015-07-06 19:19:17 Apid 31025976: OOM killer terminated
this process.)
It seems to be a memory problem. I monitor the memory
usage
as far as I can and it seems that memory usage is pretty
low. The most
memory intensive part of the program is probably the LU
decomposition in
the context of the generalized EVP. Is there a way to
evaluate how much
memory will be required for that step? I am currently
running the debug
version of the code which I would assume would use more
memory?
I have attached the output of the job. Note that the
program uses twice PETSc: 1) to solve a linear system for
which no problem
occurs, and, 2) to solve the Generalized EVP with SLEPc,
where I get the
error.
Thanks
Anthony
<test.out-superlu_dist-151x151>
In the output you are attaching there are no SLEPc
objects in
the report and SLEPc options are not used. It seems that
SLEPc calls are
skipped?
Do you get the same error with MUMPS? Have you tried to
solve
linear systems with a preconditioned iterative solver?
Jose
>
<module_petsc.F90><test.out-mumps-151x151><test.out_superlu_dist-101x101><test.out-superlu_dist-151x151>