Hi Mike,

What compiler are you using? What ./configure command-line options did you use? Did you run ./configure with --enable-MPI_Alloc_mem? I personally develop and test on SuSE, rocks 3.3 (which is a RHEL derivitive), windows, and occasionally OS X. To further track this problem down it may be necessary to compile with debug options and run the program with a debugger attached. If you can send me the query data set (off the list) I can try to reproduce the problem.

-Aaron


Mike Schilling wrote:

Unfortunately I send the mail partially already - but incomplete - sorry for posting again ...

-------------

Hallo everybody,

... I try to run mpiblast on a 17 node (2proc each) oscar 4.2 cluster based on RHEL 4. The master node is a dual Xeon2.4 while the 16 workers are dual Pentium III (933).

All cluster tests described in the oscar installation were successful and also I was able to compile the NCBI toolbox (patch included) without problems. Mpiblast was compiled using the following options:

--with-ncbi=/usr/local/ncbi and --with-mpi=/opt/lam-7.0.6

This was successful as well. Blasting very small contigs (8kb) against the uniprot database works perfect and fast. When I try to go with querys over 30kb in size the following error occur after roughly 5 min:

mpirun -np 34 /usr/local/bin/mpiblast --debug=/database/tmpfiledir/debug -p blastx -d uniprot -i /database/tmpfiledir/contig76_53738-84646.tmp -o /database/results/contigs_masked.out
*** glibc detected *** free(): invalid pointer: 0x08a34870 ***
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 20219 failed on node n7 (10.0.0.8) due to signal 6.
-----------------------------------------------------------------------------
2       134.052 Bailing out with signal -1
3       134.054 Bailing out with signal -1
4       134.055 Bailing out with signal -1
5       134.057 Bailing out with signal -1
67      134.059 Bailing out with signal -1
      134.06  Bailing out with signal -1
89      134.063 Bailing out with signal -1
      134.063 Bailing out with signal -1
1011    134.066 Bailing out with signal -1
      134.066 Bailing out with signal -1
12      134.069 Bailing out with signal -1
15      134.071 Bailing out with signal -1
13      134.071 Bailing out with signal -1
17      134.076 Bailing out with signal -1
16      134.076 Bailing out with signal -1
18      134.077 Bailing out with signal -1
19      134.078 Bailing out with signal -1
20      134.08  Bailing out with signal -1
21      134.081 Bailing out with signal -1
22      23      134.084 Bailing out with signal -1
134.084 Bailing out with signal -124 134.086 Bailing out with signal -1
25      134.088 Bailing out with signal
-1
27      134.09  Bailing out with signal -1
26      134.091 Bailing out with signal -1
28      134.09329       134.094 Bailing out with signal -1
      Bailing out with signal -1
30      134.096 Bailing out with signal -1
32      134.099 Bailing out with signal -1
34      134.1   Bailing out with signal -1
33      134.101 Bailing out with signal -1
31      134.097 Bailing out with signal -1
35      134.1   Bailing out with signal -1


... there is no further bad message when I switch on debug logs - it seems to break suddenly and the address of the "glibc free()" message sometimes shows different numbers on different nodes ...

I tried the "-ssi rpi lamd" option of mpirun as well since there was something in the manual in connection with lamd - same result ....

Next - I compiled both - the 2004 and the 2005 version of the ncbi toolbox with a similar result. Also compiling against mpich or lam has no influence. As well I tried to compile the 1.3.0 release of mpiblast but without success.

Are there any things which I can do to get more debug output? Do you have a recommendation about a kernel or a specific RHEL version (or maybe a specific cluster software) where it runs since I do not believe that it relies on hardware.

any help would be appreciated

best regards


Mike




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users

Reply via email to