Anton, Thanks for effort on reducing PME blocking time. I'm looking at.
чт, 5 дек. 2019 г. в 16:14, Anton Vinogradov <a...@apache.org>: > Igniters, > > Solution ready to be reviewed > TeamCity is green: > > https://issues.apache.org/jira/browse/IGNITE-9913?focusedCommentId=16988566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16988566 > Issue: https://issues.apache.org/jira/browse/IGNITE-9913 > Relevant PR: https://github.com/apache/ignite/pull/7069 > > In brief: Solution allows avoiding PME and running transactions waiting on > node left. > > What done/changed: > Extended Late Affinity Assignment semantic: from "guarantee to switch when > ideal primaries rebalanced" to "guarantee to switch when all expected > owners rebalanced". > This extension provides the ability to perform PME-free switch on baseline > node left when cluster fully rebalanced, avoiding a lot of corner-cases > possible on a partially rebalanced cluster. > > The PME-free switch does not block or cancel or wait for current > operations. > It's just recover failed primaries and once cluster-wide recovery finished > it finishes exchange future (allows new operations). > So in other words, now new-topology operations allowed immediately after > cluster-wide recovery finished (which is fast), not after previous-topology > operations finished + recovery + PME, as it was before. > > Limitations: > Optimization works only when baseline set (alive primaries require no > relocation, recovery required for failed primaries). > > BTW, PME-free code seems to be small and simple, most of the changes > related to the "rebalanced cluster guarantee". > > It will be nice if someone will check the code and tests. > Test coverage proposals are welcome! > > P.s. Here's a benchmark used to check we have performance improvements > listed above > > https://github.com/anton-vinogradov/ignite/blob/426f0de9185ac2938dadb7bd2cbfe488233fe7d6/modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/Bench.java > > P.p.s. We also checked the same case on the real environment and found no > latency degradation. > > On Thu, Sep 19, 2019 at 2:18 PM Anton Vinogradov <a...@apache.org> wrote: > > > Ivan, > > > > The useful profile can be produced only by the reproducible benchmark. > > We're working on such benchmark to analyze duration and measure the > > improvement. > > > > Currently, only the typical overall duration from prod servers is known. > > > > On Thu, Sep 19, 2019 at 12:09 PM Павлухин Иван <vololo...@gmail.com> > > wrote: > > > >> Anton, folks, > >> > >> Out of curiosity. Do we have some measurements of PME execution in an > >> environment similar to some real-world one? In my mind some kind of > >> "profile" showing durations of different PME phases with an indication > >> where we are "blocked" would be ideal. > >> > >> ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <a...@apache.org>: > >> > > >> > Alexey, > >> > > >> > >> Can you describe the complete protocol changes first > >> > The current goal is to find a better way. > >> > I had at least 5 scenarios discarded because of finding corner cases > >> > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev). > >> > That's why I explained what we able to improve and why I think it > works. > >> > > >> > >> we need to remove this property, not add new logic that relies on > it. > >> > Agree. > >> > > >> > >> How are you going to synchronize this? > >> > Thanks for providing this case, seems it discards #1 + #2.2 case and > >> #2.1 > >> > still possible with some optimizations. > >> > > >> > "Zombie eating transactions" case can be theoretically solved, I > think. > >> > As I said before we may perform "Local switch" in case affinity was > not > >> > changed (except failed mode miss) other cases require regular PME. > >> > In this case, new-primary is an ex-backup and we expect that > old-primary > >> > will try to update new-primary as a backup. > >> > New primary will handle operations as a backup until it notified it's > a > >> > primary now. > >> > Operations from ex-primary will be discarded or remapped once > >> new-primary > >> > notified it became the primary. > >> > > >> > Discarding is a big improvement, > >> > remapping is a huge improvement, > >> > there is no 100% warranty that ex-primary will try to update > >> new-primary as > >> > a backup. > >> > > >> > A lot of corner cases here. > >> > So, seems minimized sync is a better solution. > >> > > >> > Finally, according to your and Alexey Scherbakov's comments, the > better > >> > case is just to improve PME to wait for less, at least now. > >> > Seems, we have to wait for (or cancel, I vote for this case - any > >> > objections?) operations related to the failed primaries and wait for > >> > recovery finish (which is fast). > >> > In case affinity was changed and backup-primary switch (not related to > >> the > >> > failed primaries) required between the owners or even rebalance > >> required, > >> > we should just ignore this and perform only "Recovery PME". > >> > Regular PME should happen later (if necessary), it can be even delayed > >> (if > >> > necessary). > >> > > >> > Sounds good? > >> > > >> > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk < > >> > alexey.goncha...@gmail.com> wrote: > >> > > >> > > Anton, > >> > > > >> > > I looked through the presentation and the ticket but did not find > any > >> new > >> > > protocol description that you are going to implement. Can you > >> describe the > >> > > complete protocol changes first (to save time for both you and > during > >> the > >> > > later review)? > >> > > > >> > > Some questions that I have in mind: > >> > > * It looks like that for "Local Switch" optimization you assume > that > >> node > >> > > failure happens immediately for the whole cluster. This is not true > - > >> some > >> > > nodes may "see" a node A failure, while others still consider it > >> alive. > >> > > Moreover, node A may not know yet that it is about to be failed and > >> process > >> > > requests correctly. This may happen, for example, due to a long GC > >> pause on > >> > > node A. In this case, node A resumes it's execution and proceeds to > >> work as > >> > > a primary (it has not received a segmentation event yet), node B > also > >> did > >> > > not receive the A FAILED event yet. Node C received the event, ran > the > >> > > "Local switch" and assumed a primary role, node D also received the > A > >> > > FAILED event and switched to the new primary. Transactions from > nodes > >> B and > >> > > D will be processed simultaneously on different primaries. How are > you > >> > > going to synchronize this? > >> > > * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove > this > >> > > property, not add new logic that relies on it. There is no way a > user > >> can > >> > > calculate this property or adjust it in runtime. If a user decreases > >> this > >> > > property below a safe value, we will get inconsistent update > counters > >> and > >> > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a > >> large > >> > > value and will push HWM forward very quickly, much faster than > during > >> > > regular updates (you will have to increment it for each partition) > >> > > > >> > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <a...@apache.org>: > >> > > > >> > > > Igniters, > >> > > > > >> > > > Recently we had a discussion devoted to the non-blocking PME. > >> > > > We agreed that the most important case is a blocking on node > >> failure and > >> > > it > >> > > > can be splitted to: > >> > > > > >> > > > 1) Affected partition’s operations latency will be increased by > node > >> > > > failure detection duration. > >> > > > So, some operations may be freezed for 10+ seconds at real > clusters > >> just > >> > > > waiting for a failed primary response. > >> > > > In other words, some operations will be blocked even before > >> blocking PME > >> > > > started. > >> > > > > >> > > > The good news here that "bigger cluster decrease blocked > operations > >> > > > percent". > >> > > > > >> > > > Bad news that these operations may block non-affected operations > at > >> > > > - customers code (single_thread/striped pool usage) > >> > > > - multikey operations (tx1 one locked A and waits for failed B, > >> > > > non-affected tx2 waits for A) > >> > > > - striped pools inside AI (when some task wais for tx.op() in sync > >> way > >> > > and > >> > > > the striped thread is busy) > >> > > > - etc ... > >> > > > > >> > > > Seems, we already, thanks to StopNodeFailureHandler (if > configured), > >> > > always > >> > > > send node left event before node stop to minimize the waiting > >> period. > >> > > > So, only cases cause the hang without the stop are the problems > now. > >> > > > > >> > > > Anyway, some additional research required here and it will be nice > >> if > >> > > > someone willing to help. > >> > > > > >> > > > 2) Some optimizations may speed-up node left case (eliminate > >> upcoming > >> > > > operations blocking). > >> > > > A full list can be found at presentation [1]. > >> > > > List contains 8 optimizations, but I propose to implement some at > >> phase > >> > > one > >> > > > and the rest at phase two. > >> > > > Assuming that real production deployment has Baseline enabled we > >> able to > >> > > > gain speed-up by implementing the following: > >> > > > > >> > > > #1 Switch on node_fail/node_left event locally instead of starting > >> real > >> > > PME > >> > > > (Local switch). > >> > > > Since BLT enabled we always able to switch to the new-affinity > >> primaries > >> > > > (no need to preload partitions). > >> > > > In case we're not able to switch to new-affinity primaries (all > >> missed or > >> > > > BLT disabled) we'll just start regular PME. > >> > > > The new-primary calculation can be performed locally or by the > >> > > coordinator > >> > > > (eg. attached to the node_fail message). > >> > > > > >> > > > #2 We should not wait for any already started operations > completion > >> > > (since > >> > > > they not related to failed primary partitions). > >> > > > The only problem is a recovery which may cause update-counters > >> > > duplications > >> > > > in case of unsynced HWM. > >> > > > > >> > > > #2.1 We may wait only for recovery completion (Micro-blocking > >> switch). > >> > > > Just block (all at this phase) upcoming operations during the > >> recovery by > >> > > > incrementing the topology version. > >> > > > So in other words, it will be some kind of PME with waiting, but > it > >> will > >> > > > wait for recovery (fast) instead of finishing current operations > >> (long). > >> > > > > >> > > > #2.2 Recovery, theoretically, can be async. > >> > > > We have to solve unsynced HWM issue (to avoid concurrent usage of > >> the > >> > > same > >> > > > counters) to make it happen. > >> > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at > >> > > new-primary > >> > > > and continue recovery in an async way. > >> > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of > >> > > committed > >> > > > transactions we expect between "the first backup committed tx1" > and > >> "the > >> > > > last backup committed the same tx1". > >> > > > I propose to use it to specify the number of prepared transactions > >> we > >> > > > expect between "the first backup prepared tx1" and "the last > backup > >> > > > prepared the same tx1". > >> > > > Both cases look pretty similar. > >> > > > In this case, we able to make switch fully non-blocking, with > async > >> > > > recovery. > >> > > > Thoughts? > >> > > > > >> > > > So, I'm going to implement both improvements at "Lightweight > >> version of > >> > > > partitions map exchange" issue [2] if no one minds. > >> > > > > >> > > > [1] > >> > > > > >> > > > > >> > > > >> > https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing > >> > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913 > >> > > > > >> > > > >> > >> > >> > >> -- > >> Best regards, > >> Ivan Pavlukhin > >> > > > -- Best regards, Alexei Scherbakov