Thanks Dan and Matt!

On Fri, May 19, 2017 at 2:48 PM, Matt Riedemann <[email protected]> wrote:

> FYI
>
>
>
> -------- Forwarded Message --------
> Subject: [openstack-dev] [nova] Boston Forum session recap - cellsv2
> Date: Fri, 19 May 2017 08:13:24 -0700
> From: Dan Smith <[email protected]>
> Reply-To: OpenStack Development Mailing List (not for usage questions) <
> [email protected]>
> To: OpenStack Development Mailing List (not for usage questions) <
> [email protected]>
>
> The etherpad for this session is here [1]. The goal of the session was
> to get some questions answered that the developers had for operators
> around the topic of cellsv2.
>
> The bulk of the time was spent discussing ways to limit instance
> scheduling retries in a cellsv2 world where placement eliminates
> resource-reservation races. Reschedules would be upcalls from the cell,
> which we are trying to avoid.
>
> While placement should eliminate 95% (or more) of reschedules due to
> pre-claiming resources before booting, there will still be cases where
> we may want to reschedule due to unexpected transient failures. How many
> of those remain, and whether or not rescheduling for them is really
> useful is in question.
>
> The compromise that seemed popular in the room was to grab more than one
> host at the time of scheduling, claim for that one, but pass the rest to
> the cell. If the cell needs to reschedule, the cell conductor would try
> one of the alternates that came as part of the original boot request,
> instead of asking scheduler again.
>
> During the discussion of this, an operator raised the concern that
> without reschedules, a single compute that fails to boot 100% of the
> time ends up becoming a magnet for all future builds, looking like an
> excellent target for the scheduler, but failing anything that is sent to
> it. If we don't reschedule, that situation could be very problematic. An
> idea came out that we should really have compute monitor and disable
> itself if a certain number of _consecutive_ build failures crosses a
> threshold. That would mitigate/eliminate the "fail magnet" behavior and
> further reduce the need for retries. A patch has been proposed for this,
> and so far enjoys wide support [2].
>
> We also discussed the transition to counting quotas, and what that means
> for operators. The room seemed in favor of this, and discussion was brief.
>
> Finally, I made the call for people with reasonably-sized pre-prod
> environments to begin testing cellsv2 to help prove it out and find the
> gremlins. CERN and NeCTAR specifically volunteered for this effort.
>
> [1]
> https://etherpad.openstack.org/p/BOS-forum-cellsv2-developer
> -community-coordination
> [2] https://review.openstack.org/#/c/463597/
>
> --Dan
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> _______________________________________________
> OpenStack-operators mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to