Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Mon, 25 Sep 2006 10:50 -0500:
Can anyone make any sense of this?
I have a feeling these are related to the hangups I'm having w/o the client interface in openib. This is built off of latest cvs head. 6 server nodes, 1 client node. mounted via pvfs2-client over openib.

They are exactly related.

Log message from the client:

[E 10:19:33.127182] fp_multiqueue_cancel: flow proto cancel called on 0x10151cf0 [E 10:19:33.127283] handle_io_error: flow proto error cleanup started on 0x10151cf0, error_code: -1610612737

The client is bored of waiting for one of its IO flows to finish.  A
read or write operation.  The error code translates to "Operation
cancelled (possibly due to timeout)", indicating the client itself
did BMI_Cancel() after 30 sec of waiting for a response.  Things are
designed to recover after this, but they may not as that's not well
debugged and "should not happen" in normal operation.  Even if it
did recover properly, your performance would be terrible.

Did you take a look at debugging your netpipe failure testcase?
That seems like the lowest level where we can figure out what is
going wrong.  You should not be getting timeouts at all, and
appearances point to messages getting lost in the network somehow.
I cannot get your testcase to fail here, after over 72 hours of
continuous testing.
I wish that were the case on our end, hopefully the utility you sent me will point out some points of failure-ish on the network, and I can resolve those. (Pete) Also, if you're interested in testing on our end, I'm sending an offline mail to you in a few minutes with instructions.
Also did you have a chance to run the network debugging tool I sent
you offline?  Both these last mails from me should have appeared me
on Monday last week.

I'm running the network debugging tool right now, though not sure if its running correctly, maybe I need to look back at that email to see when it will complete. I tried the '8 minute' test you suggested and we're running on 25 minutes now.
You really can't expect to get the full system working until you fix
the basic failure.
I thought I'd take a blind shot and try it out, in hopes that the error messages I get here would provide some insight into whats breaking underneath since those messages aren't helping out right now.
                -- Pete





--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to