Anton, folks, Out of curiosity. Do we have some measurements of PME execution in an environment similar to some real-world one? In my mind some kind of "profile" showing durations of different PME phases with an indication where we are "blocked" would be ideal.
ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <a...@apache.org>: > > Alexey, > > >> Can you describe the complete protocol changes first > The current goal is to find a better way. > I had at least 5 scenarios discarded because of finding corner cases > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev). > That's why I explained what we able to improve and why I think it works. > > >> we need to remove this property, not add new logic that relies on it. > Agree. > > >> How are you going to synchronize this? > Thanks for providing this case, seems it discards #1 + #2.2 case and #2.1 > still possible with some optimizations. > > "Zombie eating transactions" case can be theoretically solved, I think. > As I said before we may perform "Local switch" in case affinity was not > changed (except failed mode miss) other cases require regular PME. > In this case, new-primary is an ex-backup and we expect that old-primary > will try to update new-primary as a backup. > New primary will handle operations as a backup until it notified it's a > primary now. > Operations from ex-primary will be discarded or remapped once new-primary > notified it became the primary. > > Discarding is a big improvement, > remapping is a huge improvement, > there is no 100% warranty that ex-primary will try to update new-primary as > a backup. > > A lot of corner cases here. > So, seems minimized sync is a better solution. > > Finally, according to your and Alexey Scherbakov's comments, the better > case is just to improve PME to wait for less, at least now. > Seems, we have to wait for (or cancel, I vote for this case - any > objections?) operations related to the failed primaries and wait for > recovery finish (which is fast). > In case affinity was changed and backup-primary switch (not related to the > failed primaries) required between the owners or even rebalance required, > we should just ignore this and perform only "Recovery PME". > Regular PME should happen later (if necessary), it can be even delayed (if > necessary). > > Sounds good? > > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk < > alexey.goncha...@gmail.com> wrote: > > > Anton, > > > > I looked through the presentation and the ticket but did not find any new > > protocol description that you are going to implement. Can you describe the > > complete protocol changes first (to save time for both you and during the > > later review)? > > > > Some questions that I have in mind: > > * It looks like that for "Local Switch" optimization you assume that node > > failure happens immediately for the whole cluster. This is not true - some > > nodes may "see" a node A failure, while others still consider it alive. > > Moreover, node A may not know yet that it is about to be failed and process > > requests correctly. This may happen, for example, due to a long GC pause on > > node A. In this case, node A resumes it's execution and proceeds to work as > > a primary (it has not received a segmentation event yet), node B also did > > not receive the A FAILED event yet. Node C received the event, ran the > > "Local switch" and assumed a primary role, node D also received the A > > FAILED event and switched to the new primary. Transactions from nodes B and > > D will be processed simultaneously on different primaries. How are you > > going to synchronize this? > > * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this > > property, not add new logic that relies on it. There is no way a user can > > calculate this property or adjust it in runtime. If a user decreases this > > property below a safe value, we will get inconsistent update counters and > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a large > > value and will push HWM forward very quickly, much faster than during > > regular updates (you will have to increment it for each partition) > > > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <a...@apache.org>: > > > > > Igniters, > > > > > > Recently we had a discussion devoted to the non-blocking PME. > > > We agreed that the most important case is a blocking on node failure and > > it > > > can be splitted to: > > > > > > 1) Affected partition’s operations latency will be increased by node > > > failure detection duration. > > > So, some operations may be freezed for 10+ seconds at real clusters just > > > waiting for a failed primary response. > > > In other words, some operations will be blocked even before blocking PME > > > started. > > > > > > The good news here that "bigger cluster decrease blocked operations > > > percent". > > > > > > Bad news that these operations may block non-affected operations at > > > - customers code (single_thread/striped pool usage) > > > - multikey operations (tx1 one locked A and waits for failed B, > > > non-affected tx2 waits for A) > > > - striped pools inside AI (when some task wais for tx.op() in sync way > > and > > > the striped thread is busy) > > > - etc ... > > > > > > Seems, we already, thanks to StopNodeFailureHandler (if configured), > > always > > > send node left event before node stop to minimize the waiting period. > > > So, only cases cause the hang without the stop are the problems now. > > > > > > Anyway, some additional research required here and it will be nice if > > > someone willing to help. > > > > > > 2) Some optimizations may speed-up node left case (eliminate upcoming > > > operations blocking). > > > A full list can be found at presentation [1]. > > > List contains 8 optimizations, but I propose to implement some at phase > > one > > > and the rest at phase two. > > > Assuming that real production deployment has Baseline enabled we able to > > > gain speed-up by implementing the following: > > > > > > #1 Switch on node_fail/node_left event locally instead of starting real > > PME > > > (Local switch). > > > Since BLT enabled we always able to switch to the new-affinity primaries > > > (no need to preload partitions). > > > In case we're not able to switch to new-affinity primaries (all missed or > > > BLT disabled) we'll just start regular PME. > > > The new-primary calculation can be performed locally or by the > > coordinator > > > (eg. attached to the node_fail message). > > > > > > #2 We should not wait for any already started operations completion > > (since > > > they not related to failed primary partitions). > > > The only problem is a recovery which may cause update-counters > > duplications > > > in case of unsynced HWM. > > > > > > #2.1 We may wait only for recovery completion (Micro-blocking switch). > > > Just block (all at this phase) upcoming operations during the recovery by > > > incrementing the topology version. > > > So in other words, it will be some kind of PME with waiting, but it will > > > wait for recovery (fast) instead of finishing current operations (long). > > > > > > #2.2 Recovery, theoretically, can be async. > > > We have to solve unsynced HWM issue (to avoid concurrent usage of the > > same > > > counters) to make it happen. > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at > > new-primary > > > and continue recovery in an async way. > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of > > committed > > > transactions we expect between "the first backup committed tx1" and "the > > > last backup committed the same tx1". > > > I propose to use it to specify the number of prepared transactions we > > > expect between "the first backup prepared tx1" and "the last backup > > > prepared the same tx1". > > > Both cases look pretty similar. > > > In this case, we able to make switch fully non-blocking, with async > > > recovery. > > > Thoughts? > > > > > > So, I'm going to implement both improvements at "Lightweight version of > > > partitions map exchange" issue [2] if no one minds. > > > > > > [1] > > > > > > > > https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913 > > > > > -- Best regards, Ivan Pavlukhin