Re: [DISCUSS] Data loss handling improvements

Alexei Scherbakov Thu, 07 May 2020 01:37:03 -0700

Yes, it will work this way.

чт, 7 мая 2020 г. в 10:43, Anton Vinogradov <[email protected]>:


> Seems I got the vision, thanks.
> There should be only 2 ways to reset lost partition: to gain an owner from
> resurrected first or to remove ex-owner from baseline (partition will be
> rearranged).
> And we should make a decision for every lost partition before calling the
> reset.
>
> On Wed, May 6, 2020 at 8:02 PM Alexei Scherbakov <
> [email protected]> wrote:
>
> > ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <[email protected]>:
> >
> > > Alexei,
> > >
> > > 1,2,4,5 - looks good to me, no objections here.
> > >
> > > >> 3. Lost state is impossible to reset if a topology doesn't have at
> > least
> > > >> one owner for each lost partition.
> > >
> > > Do you mean that, according to your example, where
> > > >> a node2 has left, soon a node3 has left. If the node2 is returned to
> > > >> the topology first, it would have stale data for some keys.
> > > we have to have node2 at cluster to be able to reset "lost" to node2's
> > > data?
> > >
> >
> > Not sure if I understand a question, but try to answer using an example:
> > Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
> > owned by n2 and n3.
> > 1. Topology is activated.
> > 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
> > 3. n2 has failed.
> > 4. cache.put(p, 1) // n3 has p->1, updateCounter=2
> > 5. n3 has failed, partition loss is happened.
> > 6. n2 joins a topology, it has stale data (p->0)
> >
> > We actually have 2 issues:
> > 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is
> diverged
> > and will not be adjusted by counters rebalancing if n3 is later joins a
> > topology.
> > or
> > 8. n3 joins a topology, it has actual data (p->1) but rebalancing will
> not
> > work because joining node has highest counter (it can only be a demander
> in
> > this scenario).
> >
> > In both cases rebalancing by counters will not work causing data
> divergence
> > in copies.
> >
> >
> > >
> > > >> at least one owner for each lost partition.
> > > What the reason to have owners for all lost partitions when we want to
> > > reset only some (available)?
> > >
> >
> > It's never were possible to reset only subset of lost partitions. The
> > reason is to make guarantee of resetLostPartitions method - all cache
> > operations are resumed, data is correct.
> >
> >
> > > Will it be possible to perform operations on non-lost partitions when
> the
> > > cluster has at least one lost partition?
> > >
> >
> > Yes it will be.
> >
> >
> > >
> > > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> > > [email protected]> wrote:
> > >
> > > > Folks,
> > > >
> > > > I've almost finished a patch bringing some improvements to the data
> > loss
> > > > handling code, and I wish to discuss proposed changes with the
> > community
> > > > before submitting.
> > > >
> > > > *The issue*
> > > >
> > > > During the grid's lifetime, it's possible to get into a situation
> when
> > > some
> > > > data nodes have failed or mistakenly stopped. If a number of stopped
> > > nodes
> > > > exceeds a certain threshold depending on configured backups, count a
> > data
> > > > loss will occur. For example, a grid having one backup (meaning at
> > least
> > > > two copies of each data partition exist at the same time) can
> tolerate
> > > only
> > > > one node loss at the time. Generally, data loss is guaranteed to
> occur
> > if
> > > > backups + 1 or more nodes have failed simultaneously using default
> > > affinity
> > > > function.
> > > >
> > > > For in-memory caches, data is lost forever. For persistent caches,
> data
> > > is
> > > > not physically lost and accessible again after failed nodes are
> > returned
> > > to
> > > > the topology.
> > > >
> > > > Possible data loss should be taken into consideration while designing
> > an
> > > > application.
> > > >
> > > >
> > > >
> > > > *Consider an example: money is transferred from one deposit to
> another,
> > > and
> > > > all nodes holding data for one of the deposits are gone.In such a
> > case, a
> > > > transaction temporary cannot be completed until a cluster is
> recovered
> > > from
> > > > the data loss state. Ignoring this can cause data inconsistency.*
> > > > It is necessary to have an API telling us if an operation is safe to
> > > > complete from the perspective of data loss.
> > > >
> > > > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > > > configured to switch caches to the partial availability mode if data
> > loss
> > > > is detected.
> > > >
> > > > Let's give two definitions according to the Javadoc for
> > > > *PartitionLossPolicy*:
> > > >
> > > > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > > > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> > > >
> > > > ·   *Unsafe policy* - cache operations are always possible
> > > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured,
> > lost
> > > > partitions automatically re-created on the remaining nodes if needed
> or
> > > > immediately owned if a last supplier has left during rebalancing.
> > > >
> > > > *That needs to be fixed*
> > > >
> > > > 1. The default loss policy is unsafe, even for persistent caches in
> the
> > > > current implementation. It can result in unintentional data loss and
> > > > business invariants' failure.
> > > >
> > > > 2. Node restarts in the persistent grid with detected data loss will
> > > cause
> > > > automatic resetting of LOST state after the restart, even if the safe
> > > > policy is configured. It can result in data loss or partition desync
> if
> > > not
> > > > all nodes are returned to the topology or returned in the wrong
> order.
> > > >
> > > >
> > > > *An example: a grid has three nodes, one backup. The grid is under
> > load.
> > > > First, a node2 has left, soon a node3 has left. If the node2 is
> > returned
> > > to
> > > > the topology first, it would have stale data for some keys. Most
> recent
> > > > data are on node3, which is not in the topology yet. Because a lost
> > state
> > > > was reset, all caches are fully available, and most probably will
> > become
> > > > inconsistent even in safe mode.*
> > > > 3. Configured loss policy doesn't provide guarantees described in the
> > > > Javadoc depending on the cluster configuration[4]. In particular,
> > unsafe
> > > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > > > automatically readjusted on node left), because partitions are not
> > > > automatically get reassigned on topology change, and no nodes are
> > > existing
> > > > to fulfill a read/write request. Same for READ_ONLY_ALL and
> > > READ_WRITE_ALL.
> > > >
> > > > 4. Calling resetLostPartitions doesn't provide a guarantee for full
> > cache
> > > > operations availability if a topology doesn't have at least one owner
> > for
> > > > each lost partition.
> > > >
> > > > The ultimate goal of the patch is to fix API inconsistencies and fix
> > the
> > > > most crucial bugs related to data loss handling.
> > > >
> > > > *The planned changes are:*
> > > >
> > > > 1. The safe policy is used by default, except for in-memory grids
> with
> > > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> > > case,
> > > > the unsafe policy is used by default. It protects from unintentional
> > data
> > > > loss.
> > > >
> > > > 2. Lost state is never reset in the case of grid nodes restart
> (despite
> > > > full restart). It makes real data loss impossible in persistent grids
> > if
> > > > following the recovery instruction.
> > > >
> > > > 3. Lost state is impossible to reset if a topology doesn't have at
> > least
> > > > one owner for each lost partition. If nodes are physically dead, they
> > > > should be removed from a baseline first before calling
> > > resetLostPartitions.
> > > >
> > > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> > > their
> > > > guarantees are impossible to fulfill, not on the full baseline.
> > > >
> > > > 5. Any operation failed due to data loss contains
> > > > CacheInvalidStateException as a root cause.
> > > >
> > > > In addition to code fixes, I plan to write a tutorial for safe data
> > loss
> > > > recovery in the persistent mode in the Ignite wiki.
> > > >
> > > > Any comments for the proposed changes are welcome.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > > > partLossPlc)
> > > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > > > [3] org.apache.ignite.IgniteCache#lostPartitions
> > > > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > > > [5]
> org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> > > >
> > >
> >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: [DISCUSS] Data loss handling improvements

Reply via email to