[EMAIL PROTECTED] wrote on Mon, 25 Sep 2006 10:50 -0500:
> Can anyone make any sense of this?
> I have a feeling these are related to the hangups I'm having w/o the
> client interface in openib.
> This is built off of latest cvs head. 6 server nodes, 1 client node.
> mounted via pvfs2-client over openib.
They are exactly related.
> Log message from the client:
>
> [E 10:19:33.127182] fp_multiqueue_cancel: flow proto cancel called on
> 0x10151cf0
> [E 10:19:33.127283] handle_io_error: flow proto error cleanup started on
> 0x10151cf0, error_code: -1610612737
The client is bored of waiting for one of its IO flows to finish. A
read or write operation. The error code translates to "Operation
cancelled (possibly due to timeout)", indicating the client itself
did BMI_Cancel() after 30 sec of waiting for a response. Things are
designed to recover after this, but they may not as that's not well
debugged and "should not happen" in normal operation. Even if it
did recover properly, your performance would be terrible.
Did you take a look at debugging your netpipe failure testcase?
That seems like the lowest level where we can figure out what is
going wrong. You should not be getting timeouts at all, and
appearances point to messages getting lost in the network somehow.
I cannot get your testcase to fail here, after over 72 hours of
continuous testing.
Also did you have a chance to run the network debugging tool I sent
you offline? Both these last mails from me should have appeared me
on Monday last week.
You really can't expect to get the full system working until you fix
the basic failure.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers