On Mon, Nov 16, 2015 at 9:33 AM Andrew Wilkins <[email protected]> wrote:
> On Tue, Nov 10, 2015 at 12:18 PM Tim Penhey <[email protected]> > wrote: > >> Hi folks, >> >> After spending three hours on this and getting nowhere, I feel it is >> time for me to throw in the towel and let someone else who knows more >> about the peergrouper pick this up. >> >> One of the changes that golang 1.5 made was to change the default for >> GOMAXPROCS [1]. This causes new fun races in our tests. >> >> A key problem with the peergrouper is around assumptions that it knows >> that is going on :-) >> >> The peergrouper has a main loop that starts a watcher for state server >> changes. >> >> The state server changes watcher gets notified that there are some known >> state servers. This then signals to the primary peergrouper through a >> function channel. The peergrouper main loop then runs this function. >> >> The state server notify function then tries to start machine watchers >> for each of the machines that are currently state servers. >> >> Each of these machine watchers starts its own goroutine to watch for >> changes, and when it gets some, passes a refresh function to to the >> peergrouper main loop using the same notify function channel. >> >> Now there are a few races here: >> 1) when the peergrouper main loop gets a notify function, it resets the >> timer with 0, which should immediately signal the timer channel. It is >> possible that this timer channel will fire before the machine refresh >> notify functions are called. I attempted to fix this by changing the >> Reset to use time.Millisecond. >> 2) When the timer is signaled, it calls desiredPeerGroup. This however >> fails if it has machines with a nil Vote, in the extras (which are >> machines that aren't part of the current replicaset). This happens when >> the timer is signaled before the machine loops have updated themselves. >> >> I attempted to fix some of these in a branch off master [2], however the >> tests still fail in other places under load. Like here: >> >> worker_test.go:268: >> c.Fatalf("timed out waiting for vote to be set") >> ... Error: timed out waiting for vote to be set >> >> But the underlying problem was: >> >> [LOG] 0:00.013 ERROR juju.worker.peergrouper peergrouper loop >> terminated: cannot compute desired peer group: voting non-machine member >> replicaset.Member{Id:1, Address:"[2001:DB8::11]:1234", >> Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil), >> Priority:(*float64)(nil), >> Tags:map[string]string{"juju-machine-id":"11"}, >> SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)} found in peer group >> worker_test.go:271: >> >> Which comes from the desiredPeerGroup function when there is a nil Votes >> in an extra machine (which you can see is the case). >> >> However I don't grok why this is happening in the test. Full test output >> here: http://paste.ubuntu.com/13213601/ >> >> There are definitely races here, but for some of the failures, I can't >> tell if it is the peergrouper itself, or the complex mocks being used. >> >> So... I'm hoping someone who groks this more will pick up where I've >> left off. >> > > See below for what not to do with timers. > > I spent some time on Friday afternoon looking at this, and Tim has > verified that it fixes the issue for him. The tests were flaky because of > how timers were started/stopped/selected-on: > > > https://github.com/axw/juju/commit/4d1099ccd9b459e3da94c6595e2a7b5e945c3f87#diff-0fdda5d614954ebabbd3194ed9ef7f20L162 > Sorry, I should have actually explained rather than just pointed at code. time.Timer.Stop will undo an already-triggered timer event. So if you do this: t := time.NewTimer(0) t.Stop() then the t.C channel may or may not have a value in it. The Stop() method returns a boolean that tells you the answer to that question: https://godoc.org/time#Timer.Stop. My solution was just to do what we do elsewhere, and start out with a nil channel and only set it to something when we want to wait on a time event. Cheers, > Andrew >
-- Juju-dev mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
