Thanks Robin, I'll give these ideas a try and try a mpich list. Cheers Ben ----- Original Message ----- From: "Robin Humble" <[EMAIL PROTECTED]> To: <oscar-users@lists.sourceforge.net> Sent: Saturday, January 06, 2007 1:03 PM Subject: Re: [Oscar-users] p4_errors net-recv wakeup_slave etc
> On Sat, Jan 06, 2007 at 11:51:44AM +1000, Ben Turner - Dayboro Geophysical > wrote: >>I have oscar-4-2 installed on my ibm eserver cluster. I am trying to run a >>parrallel >>program on five nodes. The process runs for a while successfully but then >>comes up with the following set of errors and fails. I have searched the >>archives but >>can't can't seem to find any answers. Does anybody have an idea? > > best guess is that it's a bug in your MPI code. > > the below messages are errors from mpich which look fairly generic and > don't really convey much information, at least to me. usually they just > mean one of your threads died 'cos the code ran into a problem, but it > could be many things. > > suggestions: > you could try compiling your code against LAM instead of mpich which > might produce different errors that make more sense to you. > you could try asking on an mpich mailing list. > run each thread of your code inside a debugger so you can see where it > crashes. > there's an outside chance that it's a networking problem with the > cluster, but if the code runs for a while before failing this seems > unlikely. > there are a bunch of other things it might be, but the above seem the > most likely. > > unfortunately none of the above is really OSCAR related in any way so > it probably isn't the right place to ask your questions... > > cheers, > robin > >> >>p4_1171: (2106.532602) net_recv failed for fd = 3 >>p4_1171: p4_error: net_recv read, errno = : 104 >>rm_l_4_1189: (2106.532930) net_send: could not write to fd=5, errno = 32 >>p2_1234: (2111.528337) net_recv failed for fd = 3 >>p2_1234: p4_error: net_recv read, errno = : 104 >>rm_l_2_1252: (2111.528627) net_send: could not write to fd=5, errno = 32 >>p3_1166: p4_error: net_recv read: probable EOF on socket: 1 >>rm_l_3_1184: (2109.069751) net_send: could not write to fd=5, errno = 32 >>p1_3576: (2114.044709) net_recv failed for fd = 3 >>p1_3576: p4_error: net_recv read, errno = : 104 >>rm_l_1_3594: (2114.044992) net_send: could not write to fd=5, errno = 32 >>bm_list_7639: (2116.924922) wakeup_slave: unable to interrupt slave 0 pid >>7638 >>bm_list_7639: (2116.925033) wakeup_slave: unable to interrupt slave 0 pid >>7638 >>bm_list_7639: (2116.925086) wakeup_slave: unable to interrupt slave 0 pid >>7638 >>bm_list_7639: (2116.925135) wakeup_slave: unable to interrupt slave 0 pid >>7638 >>bm_list_7639: (2116.925181) wakeup_slave: unable to interrupt slave 0 pid >>7638 >>p5_1098: p4_error: net_recv read: probable EOF on socket: 1 >>rm_l_5_1116: (2104.011456) net_send: could not write to fd=5, errno = 32 >> >>Cheers >>Ben > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Oscar-users mailing list > Oscar-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/oscar-users > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Oscar-users mailing list Oscar-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oscar-users