Greetiings!  I'm forwarding this previously submitted bug report to
the beowulf lists and the lam users list to look for interested users
who could either confirm, deny, or help resolve this bug.


--[[message/rfc822]]
From: Camm Maguire <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED],[EMAIL PROTECTED]
cc: [EMAIL PROTECTED]
Subject: Bug in lam-6.2pl3/blacs1.1/scalapack1.6 combo
Mime-Version: 1.0 (generated by tm-edit 7.106)
Content-Type: text/plain; charset=US-ASCII
Message-Id: <[EMAIL PROTECTED]>
Date: Fri, 17 Sep 1999 23:07:52 -0400



Greetings!  I've found a quite reproducible bug in the above software
combination.  The command

 mpirun -np 16 -O N xdinv

consistently fails with N=2048,nb=16,nr=nc=4 somwhere in the routine
pdgetri, specifically in the loop from lines 285 to 306.  Running with
the -lamd option to mpirun clears the problem, seeming to indicate lam
in the failure.  The MPI routines report the following error:

MPI_Recv: process in remote group is dead (rank 0, comm 3)

where the rank and comm numbers vary with no discernable pattern.  I'm
running Linux 2.2.12, on a 16 Node PII350 Beowulf over 100Mbit
switched fast ethernet.  There are no errors reported in the kernel
logs.  LAM was configured with

        ./configure --prefix=`pwd`/debian/tmp/usr/lib/lam \
                    --with-final-home=/usr/lib/lam \
                     --with-rpi=usysv \
                     --with-shared \
                     --with-cc=$(CC)

and built with 

intech19:/fix/c/home/camm/scalapack-1.6# egcc -v
Reading specs from /usr/lib/gcc-lib/i486-linux/egcs-2.91.60/specs
gcc version egcs-2.91.60 Debian 2.1 (egcs-1.1.1 release)

I've noticed that the (at least most frequent) problem block size is
16 when using double precision, which corresponds to a 2k message, the
same length as the reported lam/Linux performance problem on the web
site.   Of course, here we don't just see poor performance, but
failure.  I'll be trying lam 6.2 pl4 soon.  Please advise if I can
supply any further information regarding this bug.

PS.  Since writing this, I've tried lam-6.2b-pl4, and fournd the same
situation.  The problem appears for block sizes in the 16-28 range ;
outside that range all is stable.  Blacs is patched with the latest
mpi patch.


Take care,

Camm Maguire                                            [EMAIL PROTECTED]
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah


Reply via email to