See also https://github.com/golang/go/issues/11513
On 16 November 2015 at 01:37, Andrew Wilkins <[email protected]> wrote: > On Mon, Nov 16, 2015 at 9:33 AM Andrew Wilkins > <[email protected]> wrote: >> >> On Tue, Nov 10, 2015 at 12:18 PM Tim Penhey <[email protected]> >> wrote: >>> >>> Hi folks, >>> >>> After spending three hours on this and getting nowhere, I feel it is >>> time for me to throw in the towel and let someone else who knows more >>> about the peergrouper pick this up. >>> >>> One of the changes that golang 1.5 made was to change the default for >>> GOMAXPROCS [1]. This causes new fun races in our tests. >>> >>> A key problem with the peergrouper is around assumptions that it knows >>> that is going on :-) >>> >>> The peergrouper has a main loop that starts a watcher for state server >>> changes. >>> >>> The state server changes watcher gets notified that there are some known >>> state servers. This then signals to the primary peergrouper through a >>> function channel. The peergrouper main loop then runs this function. >>> >>> The state server notify function then tries to start machine watchers >>> for each of the machines that are currently state servers. >>> >>> Each of these machine watchers starts its own goroutine to watch for >>> changes, and when it gets some, passes a refresh function to to the >>> peergrouper main loop using the same notify function channel. >>> >>> Now there are a few races here: >>> 1) when the peergrouper main loop gets a notify function, it resets the >>> timer with 0, which should immediately signal the timer channel. It is >>> possible that this timer channel will fire before the machine refresh >>> notify functions are called. I attempted to fix this by changing the >>> Reset to use time.Millisecond. >>> 2) When the timer is signaled, it calls desiredPeerGroup. This however >>> fails if it has machines with a nil Vote, in the extras (which are >>> machines that aren't part of the current replicaset). This happens when >>> the timer is signaled before the machine loops have updated themselves. >>> >>> I attempted to fix some of these in a branch off master [2], however the >>> tests still fail in other places under load. Like here: >>> >>> worker_test.go:268: >>> c.Fatalf("timed out waiting for vote to be set") >>> ... Error: timed out waiting for vote to be set >>> >>> But the underlying problem was: >>> >>> [LOG] 0:00.013 ERROR juju.worker.peergrouper peergrouper loop >>> terminated: cannot compute desired peer group: voting non-machine member >>> replicaset.Member{Id:1, Address:"[2001:DB8::11]:1234", >>> Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil), >>> Priority:(*float64)(nil), >>> Tags:map[string]string{"juju-machine-id":"11"}, >>> SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)} found in peer group >>> worker_test.go:271: >>> >>> Which comes from the desiredPeerGroup function when there is a nil Votes >>> in an extra machine (which you can see is the case). >>> >>> However I don't grok why this is happening in the test. Full test output >>> here: http://paste.ubuntu.com/13213601/ >>> >>> There are definitely races here, but for some of the failures, I can't >>> tell if it is the peergrouper itself, or the complex mocks being used. >>> >>> So... I'm hoping someone who groks this more will pick up where I've >>> left off. >> >> >> See below for what not to do with timers. >> >> I spent some time on Friday afternoon looking at this, and Tim has >> verified that it fixes the issue for him. The tests were flaky because of >> how timers were started/stopped/selected-on: >> >> >> https://github.com/axw/juju/commit/4d1099ccd9b459e3da94c6595e2a7b5e945c3f87#diff-0fdda5d614954ebabbd3194ed9ef7f20L162 > > > Sorry, I should have actually explained rather than just pointed at code. > > time.Timer.Stop will undo an already-triggered timer event. So if you do > this: > t := time.NewTimer(0) > t.Stop() > then the t.C channel may or may not have a value in it. The Stop() method > returns a boolean that tells you the answer to that question: > https://godoc.org/time#Timer.Stop. > > My solution was just to do what we do elsewhere, and start out with a nil > channel and only set it to something when we want to wait on a time event. > >> Cheers, >> Andrew > > > -- > Juju-dev mailing list > [email protected] > Modify settings or unsubscribe at: > https://lists.ubuntu.com/mailman/listinfo/juju-dev > -- Juju-dev mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
