Hi Ron, On Wed, 2006-04-12 at 19:29, Ronald G Minnich wrote: > I was working with someone and watching a 256-node bproc cluster boot > friday. The openib folks have done a lot of very nice work. It booted > quite well once we set hoq and slv to 17 in the voltaire switch.
hoq is HOQLife. Is slv the switch LifeTimeValue ? > It was > really snappy coming up. It was actually as fast to boot as a myrinet > cluster, which was nice to see. Does that have anything to do with those settings ? > But a question. When hoq and slv were 16 in the voltaire switch, we saw > tcp sessions hanging. Truly hanging ? > Thinking back on the tcpdump we watched (would > that i had saved it) it almost seems that the sender thought it had > gotten an ack for a segment of 96 bytes, and discarded it; whereas the > receiver thought it had only gotten 32 of the 96 bytes, and was sending > back its idea of where the tcp stream was. Switches might drop 64 bytes at a time based on those parameters. > So we sat and watched (via > tcpdump on the receiver) the two hosts send each other differing ideas > about the sequence numbers on the tcp connection. > > is this at all possible? Could something happen below the tcp stack, > given a switch with too-low hoq and slv settings, such that the sender > would discard a segment that the receiver would not have ever seen? Yes, as the two directions are independent so I think that the dropping in one direction could cause this. > Is > there any switch involvment that could cause this? The whole situation > was really odd. > > Finally, this was one sender, one receiver, and the problem was very, > very repeatable -- until we bumped 16->17. That effectively doubles the time before the drops would occur which probably eliminated the drops so you didn't see this. 16 = 268.435 msec 17 = 526.871 msec What doesn't make sense to me is the one flow. Are you sure there's no other data traffic ? If so, that doesn't make sense to me and hang together with the rest of this scenario. -- Hal > Sorry I don't have more info. > > thanks > > ron > _______________________________________________ > openib-general mailing list > [email protected] > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
