Hal Rosenstock wrote:
hoq is HOQLife. Is slv the switch LifeTimeValue ?
I believe so.
Does that have anything to do with those settings ?
it would not work until hoq and slv were 17.
Truly hanging ?
yes, and it was the only real connection at that point, from the bproc daemon on the slave node to the bproc daemon on the master. There was only 1 host powered up at that point. It was very repeatable -- we tried to get it to boot many times. And, weirdly, it always hung at that same point.
Switches might drop 64 bytes at a time based on those parameters.
But why does the sender think the segment has been acked, when the receiver has never seen that last 64 bytes? Where did the sender get that TCP-level ack?
That effectively doubles the time before the drops would occur which probably eliminated the drops so you didn't see this. 16 = 268.435 msec 17 = 526.871 msec
which leads to another question. This is 1/2 second. Does it really mean that you could end up buffering 1/2 worth of flow on each port for all 256 ports?
What doesn't make sense to me is the one flow. Are you sure there's no other data traffic ? If so, that doesn't make sense to me and hang together with the rest of this scenario.
no other traffic that we could see, but there had been traffic prior to this.
Thanks hal! ron _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
