Re: Historical rebalance

Vladimir Ozerov Thu, 29 Nov 2018 11:25:05 -0800

Igor,

Yes, I tried to draw different configurations, and it really seems to work,
despite of being very hard to proof due to non-inituitive HB edges. So let
me try to spell the algorithm once again to make sure that we are on the
same page here.


1) There are two nodes - primary (P) and backup (B)
2) There are three type of events: small transactions which possibly
increments update counter (ucX), one long active transaction which is split
into multiple operations (opX), and checkpoints (cpX)
3) Every node always has current update counter. When transaction commits
it may or may not shift this counter further depending on whether there are
holes behind. But we have a strict rule that it always grow. Higher
coutners synchrnoizes with smaller. Possible cases:
----uc1----uc2----uc3----
----uc1--------uc3------- // uc2 missing due to reorder, but is is ok

4) Operations within a single transaction is always applied sequentially,
and hence also have HB edge:
----op1----op2----op3----

5) When transaction operation happens, we save in memory current update
counter available at this moment. I.e. we have a map from transaction ID to
update counter which was relevant by the time last *completed* operation
*started*. This is very important thing - we remember the counter when
operation starts, but update the map only when it finishes. This is needed
for situation when update counter is bumber in the middle of a long
operation.
----uc1----op1----op2----uc2----uc3----op3----
            |      |                    |
           uc1    uc1                  uc3

state: tx1 -> op3 -> uc3

6) Whenever checkpoint occurs, we save two counters with: "current" and
"backpointer". The latter is the smallest update counter associated with
active transactions. If there are no active transactions, current update
counter is used.

Example 1: no active transactions.
----uc1----cp1----
     ^      |
     --------

state: cp1 [current=uc1, backpointer=uc1]

Example 2: one active transaction:
                                 ---------------
                                 |             |
----uc1----op1----uc2----op2----op3----uc3----cp1----
                   ^             |
                   --------------

state: tx1 -> op3 -> uc2
       cp1 [current=uc3, backpointer=uc2]

7) Historical rebalance:
7.1) Demander finds latest checkpoint, get it's backpointer and sends it to
supplier.
7.2) Supplier finds earliest checkpoint where [supplier(current) <=
demander(backpointer)]
7.3) Supplier reads checkpoint backpointer and finds associated WAL record.
This is where we start.

So in terms of WAL we have: supplier[uc_backpointer <- cp(uc_current <=
demanter_uc_backpointer)] <- demander[uc_backpointer <- cp(last)]

Now the most important - why it works :-)
1) Transaction opeartions are sequential, so at the time of crash nodes are *at
most one operation ahead *each other
2) Demander goes to the past and finds update counter which was current at
the time of last TX completed operation
3) Supplier goes to the closest checkpoint in the past where this update
counter either doesn't exist or just appeared
4) Transaction cannot be committed on supplier at this checkpoint, as it
would violate UC happens-before rule
5) Tranasction may have not started yet on supplier at this point. If more
recent WAL records will contain *ALL* updates of the transaction
6) Transaction may exist on supplier at this checkpoint. Thanks to p.1 we
must skip at most one operation. Jump back through supplier's checkpoint
backpointer is guaranteed to do this.

Igor, do we have the same understanding here?

Vladimir.

On Thu, Nov 29, 2018 at 2:47 PM Seliverstov Igor <gvvinbl...@gmail.com>
wrote:

> Ivan,
>
> different transactions may be applied in different order on backup nodes.
> That's why we need an active tx set
> and some sorting by their update times. The idea is to identify a point in
> time which starting from we may lost some updates.
> This point:
>    1) is the last acknowledged by all backups (including possible further
> demander) update on timeline;
>    2) have a specific update counter (aka back-counter) which we going to
> start iteration from.
>
> After additional thinking on, I've identified a rule:
>
> There is two fences:
>   1) update counter (UC) - this means that all updates, with less UC than
> applied one, was applied on a node, having this UC.
>   2) update in scope of TX - all updates are applied one by one
> sequentially, this means that the fact of update guaranties the previous
> update (statement) was finished on all TX participants.
>
> Сombining them, we can say the next:
>
> All updates, that was acknowledged at the time the last update of tx, which
> updated UC, applied, are guaranteed to be presented on a node having such
> UC
>
> We can use this rule to find an iterator start pointer.
>
> ср, 28 нояб. 2018 г. в 20:26, Павлухин Иван <vololo...@gmail.com>:
>
> > Guys,
> >
> > Another one idea. We can introduce additional update counter which is
> > incremented by MVCC transactions right after executing operation (like
> > is done for classic transactions). And we can use that counter for
> > searching needed WAL records. Can it did the trick?
> >
> > P.S. Mentally I am trying to separate facilities providing
> > transactions and durability. And it seems to me that those facilities
> > are in different dimensions.
> > ср, 28 нояб. 2018 г. в 16:26, Павлухин Иван <vololo...@gmail.com>:
> > >
> > > Sorry, if it was stated that a SINGLE transaction updates are applied
> > > in a same order on all replicas then I have no questions so far. I
> > > thought about reordering updates coming from different transactions.
> > > > I have not got why we can assume that reordering is not possible.
> What
> > > have I missed?
> > > ср, 28 нояб. 2018 г. в 13:26, Павлухин Иван <vololo...@gmail.com>:
> > > >
> > > > Hi,
> > > >
> > > > Regarding Vladimir's new idea.
> > > > > We assume that transaction can be represented as a set of
> > independent operations, which are applied in the same order on both
> primary
> > and backup nodes.
> > > > I have not got why we can assume that reordering is not possible.
> What
> > > > have I missed?
> > > > вт, 27 нояб. 2018 г. в 14:42, Seliverstov Igor <gvvinbl...@gmail.com
> >:
> > > > >
> > > > > Vladimir,
> > > > >
> > > > > I think I got your point,
> > > > >
> > > > > It should work if we do the next:
> > > > > introduce two structures: active list (txs) and candidate list
> > (updCntr ->
> > > > > txn pairs)
> > > > >
> > > > > Track active txs, mapping them to actual update counter at update
> > time.
> > > > > On each next update put update counter, associated with previous
> > update,
> > > > > into a candidates list possibly overwrite existing value (checking
> > txn)
> > > > > On tx finish remove tx from active list only if appropriate update
> > counter
> > > > > (associated with finished tx) is applied.
> > > > > On update counter update set the minimal update counter from the
> > candidates
> > > > > list as a back-counter, clear the candidate list and remove an
> > associated
> > > > > tx from the active list if present.
> > > > > Use back-counter instead of actual update counter in demand
> message.
> > > > >
> > > > > вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor <
> gvvinbl...@gmail.com
> > >:
> > > > >
> > > > > > Ivan,
> > > > > >
> > > > > > 1) The list is saved on each checkpoint, wholly (all transactions
> > in
> > > > > > active state at checkpoint begin).
> > > > > > We need whole the list to get oldest transaction because after
> > > > > > the previous oldest tx finishes, we need to get the following
> one.
> > > > > >
> > > > > > 2) I guess there is a description of how persistent storage works
> > and how
> > > > > > it restores [1]
> > > > > >
> > > > > > Vladimir,
> > > > > >
> > > > > > the whole list of what we going to store on checkpoint (updated):
> > > > > > 1) Partition counter low watermark (LWM)
> > > > > > 2) WAL pointer of earliest active transaction write to partition
> > at the
> > > > > > time the checkpoint have started
> > > > > > 3) List of prepared txs with acquired partition counters (which
> > were
> > > > > > acquired but not applied yet)
> > > > > >
> > > > > > This way we don't need any additional info in demand message.
> > Start point
> > > > > > can be easily determined using stored WAL "back-pointer".
> > > > > >
> > > > > > [1]
> > > > > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess
> > > > > >
> > > > > >
> > > > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov <
> > voze...@gridgain.com>:
> > > > > >
> > > > > >> Igor,
> > > > > >>
> > > > > >> Could you please elaborate - what is the whole set of
> information
> > we are
> > > > > >> going to save at checkpoint time? From what I understand this
> > should be:
> > > > > >> 1) List of active transactions with WAL pointers of their first
> > writes
> > > > > >> 2) List of prepared transactions with their update counters
> > > > > >> 3) Partition counter low watermark (LWM) - the smallest
> partition
> > counter
> > > > > >> before which there are no prepared transactions.
> > > > > >>
> > > > > >> And the we send to supplier node a message: "Give me all updates
> > starting
> > > > > >> from that LWM plus data for that transactions which were active
> > when I
> > > > > >> failed".
> > > > > >>
> > > > > >> Am I right?
> > > > > >>
> > > > > >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor <
> > gvvinbl...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Igniters,
> > > > > >> >
> > > > > >> > Currently I’m working on possible approaches how to implement
> > historical
> > > > > >> > rebalance (delta rebalance using WAL iterator) over MVCC
> caches.
> > > > > >> >
> > > > > >> > The main difficulty is that MVCC writes changes on tx active
> > phase while
> > > > > >> > partition update version, aka update counter, is being applied
> > on tx
> > > > > >> > finish. This means we cannot start iteration over WAL right
> > from the
> > > > > >> > pointer where the update counter updated, but should include
> > updates,
> > > > > >> which
> > > > > >> > the transaction that updated the counter did.
> > > > > >> >
> > > > > >> > These updates may be much earlier than the point where the
> > update
> > > > > >> counter
> > > > > >> > was updated, so we have to be able to identify the point where
> > the first
> > > > > >> > update happened.
> > > > > >> >
> > > > > >> > The proposed approach includes:
> > > > > >> >
> > > > > >> > 1) preserve list of active txs, sorted by the time of their
> > first update
> > > > > >> > (using WAL ptr of first WAL record in tx)
> > > > > >> >
> > > > > >> > 2) persist this list on each checkpoint (together with TxLog
> for
> > > > > >> example)
> > > > > >> >
> > > > > >> > 4) send whole active tx list (transactions which were in
> active
> > state at
> > > > > >> > the time the node was crushed, empty list in case of graceful
> > node
> > > > > >> stop) as
> > > > > >> > a part of partition demand message.
> > > > > >> >
> > > > > >> > 4) find a checkpoint where the earliest tx exists in persisted
> > txs and
> > > > > >> use
> > > > > >> > saved WAL ptr as a start point or apply current approach in
> > case the
> > > > > >> active
> > > > > >> > tx list (sent on previous step) is empty
> > > > > >> >
> > > > > >> > 5) start iteration.
> > > > > >> >
> > > > > >> > Your thoughts?
> > > > > >> >
> > > > > >> > Regards,
> > > > > >> > Igor
> > > > > >>
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ivan Pavlukhin
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >
>

Re: Historical rebalance

Reply via email to