On Tue, Apr 14, 2009 at 01:18:14PM -0700, Steven Dake wrote: > > . message latency, if that's even possible > > Reducing the time a token is held reduces latency. So memcpy and malloc > specials does reduce latency. I don't have measures of how much, > however.
That would be interesting to measure along with throughput, it's much more relevant for applications doing coordination or locking via messages. > > . recovery speed, this seems to be getting worse, things often hang for up > > to many seconds when nodes join or leave these days > > > With trunk you see 2 second lags? I agree that the recovery engine > needs work to allow background synchronization of data sets without > blocking the entire cluster operation during the synchronization period. I'll try to measure it, my impression has been that it can be much longer than 2 seconds, but I've not been paying close attention. > > . stability with much shorter token timeouts, we currently use 10 seconds > > as default, and I know corosync should work well with something much > > shorter, it "just" needs testing/validation along with some diagnostic > > methods to figure out when you're using something too short > > yes with 16 nodes it should work with 100msec timeouts as long as the > kernel doesn't take long locks. I think all those bugs are fixed in > dlm/gfs now however. The 10 seconds in cluster 2 was to work around > those essentially "kernel lockups". There were a couple of spots in the kernel that were quickly fixed by calling schedule, they were never a big problem. IIRC the timeout was increased to 10 seconds because certain drivers or nics were doing resets which would stall network i/o. We were worried that that would be common in user environments, but I doubt it. > Perhaps we can get some QE effort around identifying a shorter default. We need to choose something that works in our testing first. I think we should go ahead and change it to something like 2 seconds by default, and see what happens. > > . stability with clusters up to 32 nodes, with diagnostic capabilities to > > immediately pinpoint the cause of a breakdown > > This is a great idea, but the diags to pinpoint the cause are very > difficult. I don't have a clear picture of how they would be designed > but we have kicked around some ideas. Yes, this will require some careful analysis, both within corosync and the networking layers... and extended access to 16-32 nodes. Dave _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais