Thanks Robin, I'll give these ideas a try and try a mpich list.
Cheers
Ben
----- Original Message ----- 
From: "Robin Humble" <[EMAIL PROTECTED]>
To: <oscar-users@lists.sourceforge.net>
Sent: Saturday, January 06, 2007 1:03 PM
Subject: Re: [Oscar-users] p4_errors net-recv wakeup_slave etc


> On Sat, Jan 06, 2007 at 11:51:44AM +1000, Ben Turner - Dayboro Geophysical 
> wrote:
>>I have oscar-4-2 installed on my ibm eserver cluster. I am trying to run a
>>parrallel
>>program on five nodes. The process runs for a while successfully but then
>>comes up with the following set of errors and fails. I have searched the
>>archives but
>>can't can't seem to find any answers. Does anybody have an idea?
>
> best guess is that it's a bug in your MPI code.
>
> the below messages are errors from mpich which look fairly generic and
> don't really convey much information, at least to me. usually they just
> mean one of your threads died 'cos the code ran into a problem, but it
> could be many things.
>
> suggestions:
> you could try compiling your code against LAM instead of mpich which
> might produce different errors that make more sense to you.
> you could try asking on an mpich mailing list.
> run each thread of your code inside a debugger so you can see where it
> crashes.
> there's an outside chance that it's a networking problem with the
> cluster, but if the code runs for a while before failing this seems
> unlikely.
> there are a bunch of other things it might be, but the above seem the
> most likely.
>
> unfortunately none of the above is really OSCAR related in any way so
> it probably isn't the right place to ask your questions...
>
> cheers,
> robin
>
>>
>>p4_1171: (2106.532602) net_recv failed for fd = 3
>>p4_1171: p4_error: net_recv read, errno = : 104
>>rm_l_4_1189: (2106.532930) net_send: could not write to fd=5, errno = 32
>>p2_1234: (2111.528337) net_recv failed for fd = 3
>>p2_1234: p4_error: net_recv read, errno = : 104
>>rm_l_2_1252: (2111.528627) net_send: could not write to fd=5, errno = 32
>>p3_1166: p4_error: net_recv read: probable EOF on socket: 1
>>rm_l_3_1184: (2109.069751) net_send: could not write to fd=5, errno = 32
>>p1_3576: (2114.044709) net_recv failed for fd = 3
>>p1_3576: p4_error: net_recv read, errno = : 104
>>rm_l_1_3594: (2114.044992) net_send: could not write to fd=5, errno = 32
>>bm_list_7639: (2116.924922) wakeup_slave: unable to interrupt slave 0 pid
>>7638
>>bm_list_7639: (2116.925033) wakeup_slave: unable to interrupt slave 0 pid
>>7638
>>bm_list_7639: (2116.925086) wakeup_slave: unable to interrupt slave 0 pid
>>7638
>>bm_list_7639: (2116.925135) wakeup_slave: unable to interrupt slave 0 pid
>>7638
>>bm_list_7639: (2116.925181) wakeup_slave: unable to interrupt slave 0 pid
>>7638
>>p5_1098: p4_error: net_recv read: probable EOF on socket: 1
>>rm_l_5_1116: (2104.011456) net_send: could not write to fd=5, errno = 32
>>
>>Cheers
>>Ben
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share 
> your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>
> 



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to