Thanks Dan and Matt! On Fri, May 19, 2017 at 2:48 PM, Matt Riedemann <[email protected]> wrote:
> FYI > > > > -------- Forwarded Message -------- > Subject: [openstack-dev] [nova] Boston Forum session recap - cellsv2 > Date: Fri, 19 May 2017 08:13:24 -0700 > From: Dan Smith <[email protected]> > Reply-To: OpenStack Development Mailing List (not for usage questions) < > [email protected]> > To: OpenStack Development Mailing List (not for usage questions) < > [email protected]> > > The etherpad for this session is here [1]. The goal of the session was > to get some questions answered that the developers had for operators > around the topic of cellsv2. > > The bulk of the time was spent discussing ways to limit instance > scheduling retries in a cellsv2 world where placement eliminates > resource-reservation races. Reschedules would be upcalls from the cell, > which we are trying to avoid. > > While placement should eliminate 95% (or more) of reschedules due to > pre-claiming resources before booting, there will still be cases where > we may want to reschedule due to unexpected transient failures. How many > of those remain, and whether or not rescheduling for them is really > useful is in question. > > The compromise that seemed popular in the room was to grab more than one > host at the time of scheduling, claim for that one, but pass the rest to > the cell. If the cell needs to reschedule, the cell conductor would try > one of the alternates that came as part of the original boot request, > instead of asking scheduler again. > > During the discussion of this, an operator raised the concern that > without reschedules, a single compute that fails to boot 100% of the > time ends up becoming a magnet for all future builds, looking like an > excellent target for the scheduler, but failing anything that is sent to > it. If we don't reschedule, that situation could be very problematic. An > idea came out that we should really have compute monitor and disable > itself if a certain number of _consecutive_ build failures crosses a > threshold. That would mitigate/eliminate the "fail magnet" behavior and > further reduce the need for retries. A patch has been proposed for this, > and so far enjoys wide support [2]. > > We also discussed the transition to counting quotas, and what that means > for operators. The room seemed in favor of this, and discussion was brief. > > Finally, I made the call for people with reasonably-sized pre-prod > environments to begin testing cellsv2 to help prove it out and find the > gremlins. CERN and NeCTAR specifically volunteered for this effort. > > [1] > https://etherpad.openstack.org/p/BOS-forum-cellsv2-developer > -community-coordination > [2] https://review.openstack.org/#/c/463597/ > > --Dan > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > _______________________________________________ > OpenStack-operators mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >
_______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
