Re: Lightweight version of partitions map exchange

Nikita Amelchev Fri, 24 May 2019 03:31:39 -0700

Hello, Igniters!

I am working on the implementation of lightweight PME for the case of
a BLT node leave. [1]


There is a question: whether to allow lightweight PME if the cluster
has MOVING partitions?

The problems that may happen if allow:
 - Nodes can differently select the primary node from current OWNING backups.
 - One part of nodes can mark a partition as LOST and another one as OWNING.

We can take states of the partitions from the node2part map. The root
cause of those problems is that when rebalancing ends (get the last
message), it updates partition state of the local node to OWNING (and
schedules partitions resend). This may lead to different affinity
re-calculations on nodes.

I see two solutions:

1. Nodes will store “moving-owning” transition of partitions state
until the rebalancing ends. Each node will locally recalculate the
affinity on node left event.
2. The coordinator will calculate affinity and send "full map"  to
nodes. In this case, nodes still should wait for topology change event
(to get correct topology in discovery).

If disallow lightweight PME when the cluster has MOVING partitions -
there are no problems and it works fine.

Any thoughts?

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev <nsamelc...@gmail.com>:
>
> Pavel,
> I have provided MTCGA bot status in Jira issue comments. [1]
>
> Eduard,
> Yes, for current implementation it will be distributed PME if
> in-memory caches configured.
>
> 1. https://issues.apache.org/jira/browse/IGNITE-9913
>
> пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev <eduard.shangar...@gmail.com>:
> >
> > Nikita,
> >
> > It sounds cool. But I didn't get about in-memory caches. The baseline is
> > not used for their affinity calculation.
> > So, this improvement would be switched off for them or completely (when
> > such caches are presented), wouldn't it?
> >
> > On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko <jokse...@gmail.com> wrote:
> >
> > > Hi Nikita,
> > >
> > > Thank you for your work. This is great improvement. I'll take look on it 
> > > in
> > > next couple of days. Could you please run TC and provide MTCGA bot status
> > > about this change?
> > >
> > > чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev <nsamelc...@gmail.com>:
> > >
> > > > Hello, Igniters!
> > > >
> > > > I have implemented lightweight version of partitions map exchange for
> > > > the case when the baseline node leaves topology. [1]
> > > >
> > > > If partitions are assigned according to the baseline topology and
> > > > server node leaves there's no actual need to perform distributed PME.
> > > > Every cluster will recalculate new affinity assignments and partition
> > > > states locally. There is no need to wait for partitions released and
> > > > PME will be started immediately.
> > > >
> > > > I have benchmarked duration of PME under yardstick load. PME duration
> > > > was decreased up to 10 times and the maximum latency of transactions
> > > > was decreased up to 4-5 times. See details in Jira issue comments. [1]
> > > >
> > > > Could some expert of PME take a look at my changes? [2]
> > > >
> > > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > > > 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
> > > >
> > > > --
> > > > Best wishes,
> > > > Amelchev Nikita
> > > >
> > >
>
>
>
> --
> Best wishes,
> Amelchev Nikita



-- 
Best wishes,
Amelchev Nikita

Re: Lightweight version of partitions map exchange

Reply via email to