Hi Mike,

I can't offer a solution for you but my gut feeling on something like this
would be some inconsistency between the compiled binary and it's
dependencies.  Was the mpiblast binary compiled on the master node and
then copied (or available through NFS) to the worker nodes? Do you have
the same version of gcc, glibc, mpi, all compiled with the same version of
gcc on all the nodes?  You're Xeon processors support Hyper-threading but
the Pentium III's do not.

If you run your job from the master node and you use the '--nolocal' flag
on the mpirun command do you still get the error?

If you did compile on your head node and copied (or exported through NFS)
the mpiblast binary to your client nodes, I would say it might be worth
recompiling mpiblast (and all dependencies if needed) on one of your
client nodes and then copy (or export through NFS) that new version to all
your worker nodes.  Then rerun your blast job and make sure the head node
isn't used as a worker node.

If your client nodes are all identical, architecture, software, libraries,
etc, I would expect that you should get a succesful run of mpiblast. By
compiling all your binaries and libraries on a client node and restricting
your job to just the client nodes you'll know if the problem is a
consistency issue between your compiled binary and the two platforms.

Best of Luck,
Stephen


On Mon, 19 Dec 2005, Aaron Darling wrote:

> Hi Mike,
>
> What compiler are you using?  What ./configure command-line options did
> you use?  Did you run ./configure with --enable-MPI_Alloc_mem?  I
> personally develop and test on SuSE, rocks 3.3 (which is a RHEL
> derivitive), windows, and occasionally OS X.
> To further track this problem down it may be necessary to compile with
> debug options and run the program with a debugger attached.  If you can
> send me the query data set (off the list) I can try to reproduce the
> problem.
>
> -Aaron
>
>
> Mike Schilling wrote:
>
> > Unfortunately I send the mail partially already - but incomplete -
> > sorry for posting again ...
> >
> > -------------
> >
> > Hallo everybody,
> >
> > ... I try to run mpiblast on a 17 node (2proc each) oscar 4.2 cluster
> > based on RHEL 4. The master node is a dual Xeon2.4 while the 16
> > workers are dual Pentium III (933).
> >
> > All cluster tests described in the oscar installation were successful
> > and also I was able to compile the NCBI toolbox (patch included)
> > without problems. Mpiblast was compiled using the following options:
> >
> > --with-ncbi=/usr/local/ncbi and --with-mpi=/opt/lam-7.0.6
> >
> > This was successful as well. Blasting very small contigs (8kb) against
> > the uniprot database works perfect and fast. When I try to go with
> > querys over 30kb in size the following error occur after roughly 5 min:
> >
> > mpirun -np 34 /usr/local/bin/mpiblast
> > --debug=/database/tmpfiledir/debug -p blastx -d uniprot -i
> > /database/tmpfiledir/contig76_53738-84646.tmp -o
> > /database/results/contigs_masked.out
> > *** glibc detected *** free(): invalid pointer: 0x08a34870 ***
> > -----------------------------------------------------------------------------
> >
> > One of the processes started by mpirun has exited with a nonzero exit
> > code.  This typically indicates that the process finished in error.
> > If your process did not finish in error, be sure to include a "return
> > 0" or "exit(0)" in your C code before exiting the application.
> >
> > PID 20219 failed on node n7 (10.0.0.8) due to signal 6.
> > -----------------------------------------------------------------------------
> >
> > 2       134.052 Bailing out with signal -1
> > 3       134.054 Bailing out with signal -1
> > 4       134.055 Bailing out with signal -1
> > 5       134.057 Bailing out with signal -1
> > 67      134.059 Bailing out with signal -1
> >       134.06  Bailing out with signal -1
> > 89      134.063 Bailing out with signal -1
> >       134.063 Bailing out with signal -1
> > 1011    134.066 Bailing out with signal -1
> >       134.066 Bailing out with signal -1
> > 12      134.069 Bailing out with signal -1
> > 15      134.071 Bailing out with signal -1
> > 13      134.071 Bailing out with signal -1
> > 17      134.076 Bailing out with signal -1
> > 16      134.076 Bailing out with signal -1
> > 18      134.077 Bailing out with signal -1
> > 19      134.078 Bailing out with signal -1
> > 20      134.08  Bailing out with signal -1
> > 21      134.081 Bailing out with signal -1
> > 22      23      134.084 Bailing out with signal -1
> > 134.084 Bailing out with signal -124    134.086 Bailing out with
> > signal -1
> > 25      134.088 Bailing out with signal
> > -1
> > 27      134.09  Bailing out with signal -1
> > 26      134.091 Bailing out with signal -1
> > 28      134.09329       134.094 Bailing out with signal -1
> >       Bailing out with signal -1
> > 30      134.096 Bailing out with signal -1
> > 32      134.099 Bailing out with signal -1
> > 34      134.1   Bailing out with signal -1
> > 33      134.101 Bailing out with signal -1
> > 31      134.097 Bailing out with signal -1
> > 35      134.1   Bailing out with signal -1
> >
> >
> > ... there is no further bad message when I switch on debug logs - it
> > seems to break suddenly and the address of the "glibc free()" message
> > sometimes shows different numbers on different nodes ...
> >
> > I tried the "-ssi rpi lamd" option of mpirun as well since there was
> > something in the manual in connection with lamd - same result ....
> >
> > Next - I compiled both - the 2004 and the 2005 version of the ncbi
> > toolbox with a similar result. Also compiling against mpich or lam has
> > no influence.  As well I tried to compile the 1.3.0 release of
> > mpiblast but without success.
> >
> > Are there any things which I can do to get more debug output? Do you
> > have a recommendation about a kernel or a specific RHEL version (or
> > maybe a specific cluster software) where it runs since I do not
> > believe that it relies on hardware.
> >
> > any help would be appreciated
> >
> > best regards
> >
> >
> > Mike
> >
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Mpiblast-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/mpiblast-users
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users

Reply via email to