Re: Non-blocking PME Phase One (Node fail)

Alexei Scherbakov Mon, 09 Dec 2019 23:09:43 -0800

Anton,

Thanks for effort on reducing PME blocking time.
I'm looking at.


чт, 5 дек. 2019 г. в 16:14, Anton Vinogradov <a...@apache.org>:

> Igniters,
>
> Solution ready to be reviewed
> TeamCity is green:
>
> https://issues.apache.org/jira/browse/IGNITE-9913?focusedCommentId=16988566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16988566
> Issue: https://issues.apache.org/jira/browse/IGNITE-9913
> Relevant PR: https://github.com/apache/ignite/pull/7069
>
> In brief: Solution allows avoiding PME and running transactions waiting on
> node left.
>
> What done/changed:
> Extended Late Affinity Assignment semantic: from "guarantee to switch when
> ideal primaries rebalanced" to "guarantee to switch when all expected
> owners rebalanced".
> This extension provides the ability to perform PME-free switch on baseline
> node left when cluster fully rebalanced, avoiding a lot of corner-cases
> possible on a partially rebalanced cluster.
>
> The PME-free switch does not block or cancel or wait for current
> operations.
> It's just recover failed primaries and once cluster-wide recovery finished
> it finishes exchange future (allows new operations).
> So in other words, now new-topology operations allowed immediately after
> cluster-wide recovery finished (which is fast), not after previous-topology
> operations finished + recovery + PME, as it was before.
>
> Limitations:
> Optimization works only when baseline set (alive primaries require no
> relocation, recovery required for failed primaries).
>
> BTW, PME-free code seems to be small and simple, most of the changes
> related to the "rebalanced cluster guarantee".
>
> It will be nice if someone will check the code and tests.
> Test coverage proposals are welcome!
>
> P.s. Here's a benchmark used to check we have performance improvements
> listed above
>
> https://github.com/anton-vinogradov/ignite/blob/426f0de9185ac2938dadb7bd2cbfe488233fe7d6/modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/Bench.java
>
> P.p.s. We also checked the same case on the real environment and found no
> latency degradation.
>
> On Thu, Sep 19, 2019 at 2:18 PM Anton Vinogradov <a...@apache.org> wrote:
>
> > Ivan,
> >
> > The useful profile can be produced only by the reproducible benchmark.
> > We're working on such benchmark to analyze duration and measure the
> > improvement.
> >
> > Currently, only the typical overall duration from prod servers is known.
> >
> > On Thu, Sep 19, 2019 at 12:09 PM Павлухин Иван <vololo...@gmail.com>
> > wrote:
> >
> >> Anton, folks,
> >>
> >> Out of curiosity. Do we have some measurements of PME execution in an
> >> environment similar to some real-world one? In my mind some kind of
> >> "profile" showing durations of different PME phases with an indication
> >> where we are "blocked" would be ideal.
> >>
> >> ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <a...@apache.org>:
> >> >
> >> > Alexey,
> >> >
> >> > >> Can you describe the complete protocol changes first
> >> > The current goal is to find a better way.
> >> > I had at least 5 scenarios discarded because of finding corner cases
> >> > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
> >> > That's why I explained what we able to improve and why I think it
> works.
> >> >
> >> > >> we need to remove this property, not add new logic that relies on
> it.
> >> > Agree.
> >> >
> >> > >> How are you going to synchronize this?
> >> > Thanks for providing this case, seems it discards #1 + #2.2 case and
> >> #2.1
> >> > still possible with some optimizations.
> >> >
> >> > "Zombie eating transactions" case can be theoretically solved, I
> think.
> >> > As I said before we may perform "Local switch" in case affinity was
> not
> >> > changed (except failed mode miss) other cases require regular PME.
> >> > In this case, new-primary is an ex-backup and we expect that
> old-primary
> >> > will try to update new-primary as a backup.
> >> > New primary will handle operations as a backup until it notified it's
> a
> >> > primary now.
> >> > Operations from ex-primary will be discarded or remapped once
> >> new-primary
> >> > notified it became the primary.
> >> >
> >> > Discarding is a big improvement,
> >> > remapping is a huge improvement,
> >> > there is no 100% warranty that ex-primary will try to update
> >> new-primary as
> >> > a backup.
> >> >
> >> > A lot of corner cases here.
> >> > So, seems minimized sync is a better solution.
> >> >
> >> > Finally, according to your and Alexey Scherbakov's comments, the
> better
> >> > case is just to improve PME to wait for less, at least now.
> >> > Seems, we have to wait for (or cancel, I vote for this case - any
> >> > objections?) operations related to the failed primaries and wait for
> >> > recovery finish (which is fast).
> >> > In case affinity was changed and backup-primary switch (not related to
> >> the
> >> > failed primaries) required between the owners or even rebalance
> >> required,
> >> > we should just ignore this and perform only "Recovery PME".
> >> > Regular PME should happen later (if necessary), it can be even delayed
> >> (if
> >> > necessary).
> >> >
> >> > Sounds good?
> >> >
> >> > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
> >> > alexey.goncha...@gmail.com> wrote:
> >> >
> >> > > Anton,
> >> > >
> >> > > I looked through the presentation and the ticket but did not find
> any
> >> new
> >> > > protocol description that you are going to implement. Can you
> >> describe the
> >> > > complete protocol changes first (to save time for both you and
> during
> >> the
> >> > > later review)?
> >> > >
> >> > > Some questions that I have in mind:
> >> > >  * It looks like that for "Local Switch" optimization you assume
> that
> >> node
> >> > > failure happens immediately for the whole cluster. This is not true
> -
> >> some
> >> > > nodes may "see" a node A failure, while others still consider it
> >> alive.
> >> > > Moreover, node A may not know yet that it is about to be failed and
> >> process
> >> > > requests correctly. This may happen, for example, due to a long GC
> >> pause on
> >> > > node A. In this case, node A resumes it's execution and proceeds to
> >> work as
> >> > > a primary (it has not received a segmentation event yet), node B
> also
> >> did
> >> > > not receive the A FAILED event yet. Node C received the event, ran
> the
> >> > > "Local switch" and assumed a primary role, node D also received the
> A
> >> > > FAILED event and switched to the new primary. Transactions from
> nodes
> >> B and
> >> > > D will be processed simultaneously on different primaries. How are
> you
> >> > > going to synchronize this?
> >> > >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove
> this
> >> > > property, not add new logic that relies on it. There is no way a
> user
> >> can
> >> > > calculate this property or adjust it in runtime. If a user decreases
> >> this
> >> > > property below a safe value, we will get inconsistent update
> counters
> >> and
> >> > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a
> >> large
> >> > > value and will push HWM forward very quickly, much faster than
> during
> >> > > regular updates (you will have to increment it for each partition)
> >> > >
> >> > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <a...@apache.org>:
> >> > >
> >> > > > Igniters,
> >> > > >
> >> > > > Recently we had a discussion devoted to the non-blocking PME.
> >> > > > We agreed that the most important case is a blocking on node
> >> failure and
> >> > > it
> >> > > > can be splitted to:
> >> > > >
> >> > > > 1) Affected partition’s operations latency will be increased by
> node
> >> > > > failure detection duration.
> >> > > > So, some operations may be freezed for 10+ seconds at real
> clusters
> >> just
> >> > > > waiting for a failed primary response.
> >> > > > In other words, some operations will be blocked even before
> >> blocking PME
> >> > > > started.
> >> > > >
> >> > > > The good news here that "bigger cluster decrease blocked
> operations
> >> > > > percent".
> >> > > >
> >> > > > Bad news that these operations may block non-affected operations
> at
> >> > > > - customers code (single_thread/striped pool usage)
> >> > > > - multikey operations (tx1 one locked A and waits for failed B,
> >> > > > non-affected tx2 waits for A)
> >> > > > - striped pools inside AI (when some task wais for tx.op() in sync
> >> way
> >> > > and
> >> > > > the striped thread is busy)
> >> > > > - etc ...
> >> > > >
> >> > > > Seems, we already, thanks to StopNodeFailureHandler (if
> configured),
> >> > > always
> >> > > > send node left event before node stop to minimize the waiting
> >> period.
> >> > > > So, only cases cause the hang without the stop are the problems
> now.
> >> > > >
> >> > > > Anyway, some additional research required here and it will be nice
> >> if
> >> > > > someone willing to help.
> >> > > >
> >> > > > 2) Some optimizations may speed-up node left case (eliminate
> >> upcoming
> >> > > > operations blocking).
> >> > > > A full list can be found at presentation [1].
> >> > > > List contains 8 optimizations, but I propose to implement some at
> >> phase
> >> > > one
> >> > > > and the rest at phase two.
> >> > > > Assuming that real production deployment has Baseline enabled we
> >> able to
> >> > > > gain speed-up by implementing the following:
> >> > > >
> >> > > > #1 Switch on node_fail/node_left event locally instead of starting
> >> real
> >> > > PME
> >> > > > (Local switch).
> >> > > > Since BLT enabled we always able to switch to the new-affinity
> >> primaries
> >> > > > (no need to preload partitions).
> >> > > > In case we're not able to switch to new-affinity primaries (all
> >> missed or
> >> > > > BLT disabled) we'll just start regular PME.
> >> > > > The new-primary calculation can be performed locally or by the
> >> > > coordinator
> >> > > > (eg. attached to the node_fail message).
> >> > > >
> >> > > > #2 We should not wait for any already started operations
> completion
> >> > > (since
> >> > > > they not related to failed primary partitions).
> >> > > > The only problem is a recovery which may cause update-counters
> >> > > duplications
> >> > > > in case of unsynced HWM.
> >> > > >
> >> > > > #2.1 We may wait only for recovery completion (Micro-blocking
> >> switch).
> >> > > > Just block (all at this phase) upcoming operations during the
> >> recovery by
> >> > > > incrementing the topology version.
> >> > > > So in other words, it will be some kind of PME with waiting, but
> it
> >> will
> >> > > > wait for recovery (fast) instead of finishing current operations
> >> (long).
> >> > > >
> >> > > > #2.2 Recovery, theoretically, can be async.
> >> > > > We have to solve unsynced HWM issue (to avoid concurrent usage of
> >> the
> >> > > same
> >> > > > counters) to make it happen.
> >> > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> >> > > new-primary
> >> > > > and continue recovery in an async way.
> >> > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> >> > > committed
> >> > > > transactions we expect between "the first backup committed tx1"
> and
> >> "the
> >> > > > last backup committed the same tx1".
> >> > > > I propose to use it to specify the number of prepared transactions
> >> we
> >> > > > expect between "the first backup prepared tx1" and "the last
> backup
> >> > > > prepared the same tx1".
> >> > > > Both cases look pretty similar.
> >> > > > In this case, we able to make switch fully non-blocking, with
> async
> >> > > > recovery.
> >> > > > Thoughts?
> >> > > >
> >> > > > So, I'm going to implement both improvements at "Lightweight
> >> version of
> >> > > > partitions map exchange" issue [2] if no one minds.
> >> > > >
> >> > > > [1]
> >> > > >
> >> > > >
> >> > >
> >>
> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> >> > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> >> > > >
> >> > >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Ivan Pavlukhin
> >>
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: Non-blocking PME Phase One (Node fail)

Reply via email to