Denis,

I like the idea that defragmentation is just an additional step on a node
(re)start like we perform PDS recovery now.
We may just use special key to specify node should defragment persistence
on (re)start.
Defragmentation can be the part of Rolling Upgrade in this case :)
It seems to be not a problem to restart nodes one-by-one, this will "eat"
only one backup guarantee.

On Mon, Oct 7, 2019 at 8:28 PM Denis Magda <dma...@apache.org> wrote:

> Alex, thanks for the summary and proposal. Anton, Ivan and others who took
> part in this discussion, what're your thoughts? I see this
> rolling-upgrades-based approach as a reasonable solution. Even though a
> node shutdown is expected, the procedure doesn't lead to the cluster outage
> meaning it can be utilized for 24x7 production environments.
>
> -
> Denis
>
>
> On Mon, Oct 7, 2019 at 1:35 AM Alexey Goncharuk <
> alexey.goncha...@gmail.com>
> wrote:
>
> > Created a ticket for the first stage of this improvement. This can be a
> > first change towards the online mode suggested by Sergey and Anton.
> > https://issues.apache.org/jira/browse/IGNITE-12263
> >
> > пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk <alexey.goncha...@gmail.com
> >:
> >
> > > Maxim,
> > >
> > > Having a cluster-wide lock for a cache does not improve availability of
> > > the solution. A user cannot defragment a cache if the cache is involved
> > in
> > > a mission-critical operation, so having a lock on such a cache is
> > > equivalent to the whole cluster shutdown.
> > >
> > > We should decide between either a single offline node or a more complex
> > > fully online solution.
> > >
> > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov <mmu...@apache.org>:
> > >
> > >> Igniters,
> > >>
> > >> This thread seems to be endless, but we if some kind of cache group
> > >> distributed write lock (exclusive for some of the internal Ignite
> > >> process) will be introduced? I think it will help to solve a batch of
> > >> problems, like:
> > >>
> > >> 1. defragmentation of all cache group partitions on the local node
> > >> without concurrent updates.
> > >> 2. improve data loading with data streamer isolation mode [1]. It
> > >> seems we should not allow concurrent updates to cache if we on `fast
> > >> data load` step.
> > >> 3. recovery from a snapshot without cache stop\start actions
> > >>
> > >>
> > >> [1] https://issues.apache.org/jira/browse/IGNITE-11793
> > >>
> > >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov <skoz...@gridgain.com>
> > wrote:
> > >> >
> > >> > Hi
> > >> >
> > >> > I'm not sure that node offline is a best way to do that.
> > >> > Cons:
> > >> >  - different caches may have different defragmentation but we force
> to
> > >> stop
> > >> > whole node
> > >> >  - offline node is a maintenance operation will require to add +1
> > >> backup to
> > >> > reduce the risk of data loss
> > >> >  - baseline auto adjustment?
> > >> >  - impact to index rebuild?
> > >> >  - cache configuration changes (or destroy) during node offline
> > >> >
> > >> > What about other ways without node stop? E.g. make cache group on a
> > node
> > >> > offline? Add *defrag <cache_group> *command to control.sh to force
> > start
> > >> > rebalance internally in the node with expected impact to
> performance.
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov <a...@apache.org>
> > wrote:
> > >> >
> > >> > > Alexey,
> > >> > > As for me, it does not matter will it be IEP, umbrella or a single
> > >> issue.
> > >> > > The most important thing is Assignee :)
> > >> > >
> > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk <
> > >> > > alexey.goncha...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Anton, do you think we should file a single ticket for this or
> > >> should we
> > >> > > go
> > >> > > > with an IEP? As of now, the change does not look big enough for
> an
> > >> IEP
> > >> > > for
> > >> > > > me.
> > >> > > >
> > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov <a...@apache.org>:
> > >> > > >
> > >> > > > > Alexey,
> > >> > > > >
> > >> > > > > Sounds good to me.
> > >> > > > >
> > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk <
> > >> > > > > alexey.goncha...@gmail.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Anton,
> > >> > > > > >
> > >> > > > > > Switching a partition to and from the SHRINKING state will
> > >> require
> > >> > > > > > intricate synchronizations in order to properly determine
> the
> > >> start
> > >> > > > > > position for historical rebalance without PME.
> > >> > > > > >
> > >> > > > > > I would still go with an offline-node approach, but instead
> of
> > >> > > cleaning
> > >> > > > > the
> > >> > > > > > persistence, we can do effective defragmentation when the
> node
> > >> is
> > >> > > > offline
> > >> > > > > > because we are sure that there is no concurrent load. After
> > the
> > >> > > > > > defragmentation completes, we bring the node back to the
> > >> cluster and
> > >> > > > > > historical rebalance will kick in automatically. It will
> still
> > >> > > require
> > >> > > > > > manual node restarts, but since the data is not removed,
> there
> > >> are no
> > >> > > > > > additional risks. Also, this will be an excellent solution
> for
> > >> those
> > >> > > > who
> > >> > > > > > can afford downtime and execute the defragment command on
> all
> > >> nodes
> > >> > > in
> > >> > > > > the
> > >> > > > > > cluster simultaneously - this will be the fastest way
> > possible.
> > >> > > > > >
> > >> > > > > > --AG
> > >> > > > > >
> > >> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov <
> a...@apache.org
> > >:
> > >> > > > > >
> > >> > > > > > > Alexei,
> > >> > > > > > > >> stopping fragmented node and removing partition data,
> > then
> > >> > > > starting
> > >> > > > > it
> > >> > > > > > > again
> > >> > > > > > >
> > >> > > > > > > That's exactly what we're doing to solve the fragmentation
> > >> issue.
> > >> > > > > > > The problem here is that we have to perform N/B
> > >> restart-rebalance
> > >> > > > > > > operations (N - cluster size, B - backups count) and it
> > takes
> > >> a lot
> > >> > > > of
> > >> > > > > > time
> > >> > > > > > > with risks to lose the data.
> > >> > > > > > >
> > >> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov <
> > >> > > > > > > alexey.scherbak...@gmail.com> wrote:
> > >> > > > > > >
> > >> > > > > > > > Probably this should be allowed to do using public API,
> > >> actually
> > >> > > > this
> > >> > > > > > is
> > >> > > > > > > > same as manual rebalancing.
> > >> > > > > > > >
> > >> > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov <
> > >> > > > > > > > alexey.scherbak...@gmail.com>:
> > >> > > > > > > >
> > >> > > > > > > > > The poor man's solution for the problem would be
> > stopping
> > >> > > > > fragmented
> > >> > > > > > > node
> > >> > > > > > > > > and removing partition data, then starting it again
> > >> allowing
> > >> > > full
> > >> > > > > > state
> > >> > > > > > > > > transfer already without deletes.
> > >> > > > > > > > > Rinse and repeat for all owners.
> > >> > > > > > > > >
> > >> > > > > > > > > Anton Vinogradov, would this work for you as
> workaround
> > ?
> > >> > > > > > > > >
> > >> > > > > > > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov <
> > >> a...@apache.org
> > >> > > >:
> > >> > > > > > > > >
> > >> > > > > > > > >> Alexey,
> > >> > > > > > > > >>
> > >> > > > > > > > >> Let's combine your and Ivan's proposals.
> > >> > > > > > > > >>
> > >> > > > > > > > >> >> vacuum command, which acquires exclusive table
> lock,
> > >> so no
> > >> > > > > > > concurrent
> > >> > > > > > > > >> activities on the table are possible.
> > >> > > > > > > > >> and
> > >> > > > > > > > >> >> Could the problem be solved by stopping a node
> which
> > >> needs
> > >> > > to
> > >> > > > > be
> > >> > > > > > > > >> defragmented, clearing persistence files and
> restarting
> > >> the
> > >> > > > node?
> > >> > > > > > > > >> >> After rebalancing the node will receive all data
> > back
> > >> > > without
> > >> > > > > > > > >> fragmentation.
> > >> > > > > > > > >>
> > >> > > > > > > > >> How about to have special partition state SHRINKING?
> > >> > > > > > > > >> This state should mean that partition unavailable for
> > >> reads
> > >> > > and
> > >> > > > > > > updates
> > >> > > > > > > > >> but
> > >> > > > > > > > >> should keep it's update-counters and should not be
> > >> marked as
> > >> > > > lost,
> > >> > > > > > > > renting
> > >> > > > > > > > >> or evicted.
> > >> > > > > > > > >> At this state we able to iterate over the partition
> and
> > >> apply
> > >> > > > it's
> > >> > > > > > > > entries
> > >> > > > > > > > >> to another file in a compact way.
> > >> > > > > > > > >> Indices should be updated during the copy-on-shrink
> > >> procedure
> > >> > > or
> > >> > > > > at
> > >> > > > > > > the
> > >> > > > > > > > >> shrink completion.
> > >> > > > > > > > >> Once shrank file is ready we should replace the
> > original
> > >> > > > partition
> > >> > > > > > > file
> > >> > > > > > > > >> with it and mark it as MOVING which will start the
> > >> historical
> > >> > > > > > > rebalance.
> > >> > > > > > > > >> Shrinking should be performed during the low activity
> > >> periods,
> > >> > > > but
> > >> > > > > > > even
> > >> > > > > > > > in
> > >> > > > > > > > >> case we found that activity was high and historical
> > >> rebalance
> > >> > > is
> > >> > > > > not
> > >> > > > > > > > >> suitable we may just remove the file and use regular
> > >> rebalance
> > >> > > > to
> > >> > > > > > > > restore
> > >> > > > > > > > >> the partition (this will also lead to shrink).
> > >> > > > > > > > >>
> > >> > > > > > > > >> BTW, seems, we able to implement partition shrink in
> a
> > >> cheap
> > >> > > > way.
> > >> > > > > > > > >> We may just use rebalancing code to apply fat
> > partition's
> > >> > > > entries
> > >> > > > > to
> > >> > > > > > > the
> > >> > > > > > > > >> new file.
> > >> > > > > > > > >> So, 3 stages here: local rebalance, indices update
> and
> > >> global
> > >> > > > > > > historical
> > >> > > > > > > > >> rebalance.
> > >> > > > > > > > >>
> > >> > > > > > > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk <
> > >> > > > > > > > >> alexey.goncha...@gmail.com> wrote:
> > >> > > > > > > > >>
> > >> > > > > > > > >> > Anton,
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > > >>  The solution which Anton suggested does not
> > look
> > >> easy
> > >> > > > > > because
> > >> > > > > > > it
> > >> > > > > > > > >> will
> > >> > > > > > > > >> > > most likely significantly hurt performance
> > >> > > > > > > > >> > > Mostly agree here, but what drop do we expect?
> What
> > >> price
> > >> > > do
> > >> > > > > we
> > >> > > > > > > > ready
> > >> > > > > > > > >> to
> > >> > > > > > > > >> > > pay?
> > >> > > > > > > > >> > > Not sure, but seems some vendors ready to pay,
> for
> > >> > > example,
> > >> > > > 5%
> > >> > > > > > > drop
> > >> > > > > > > > >> for
> > >> > > > > > > > >> > > this.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > 5% may be a big drop for some use-cases, so I think
> > we
> > >> > > should
> > >> > > > > look
> > >> > > > > > > at
> > >> > > > > > > > >> how
> > >> > > > > > > > >> > to improve performance, not how to make it worse.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > >> it is hard to maintain a data structure to
> > choose
> > >> "page
> > >> > > > > from
> > >> > > > > > > > >> free-list
> > >> > > > > > > > >> > > with enough space closest to the beginning of the
> > >> file".
> > >> > > > > > > > >> > > We can just split each free-list bucket to the
> > >> couple and
> > >> > > > use
> > >> > > > > > > first
> > >> > > > > > > > >> for
> > >> > > > > > > > >> > > pages in the first half of the file and the
> second
> > >> for the
> > >> > > > > last.
> > >> > > > > > > > >> > > Only two buckets required here since, during the
> > file
> > >> > > > shrink,
> > >> > > > > > > first
> > >> > > > > > > > >> > > bucket's window will be shrank too.
> > >> > > > > > > > >> > > Seems, this give us the same price on put, just
> use
> > >> the
> > >> > > > first
> > >> > > > > > > bucket
> > >> > > > > > > > >> in
> > >> > > > > > > > >> > > case it's not empty.
> > >> > > > > > > > >> > > Remove price (with merge) will be increased, of
> > >> course.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > The compromise solution is to have priority put
> (to
> > >> the
> > >> > > > first
> > >> > > > > > path
> > >> > > > > > > > of
> > >> > > > > > > > >> the
> > >> > > > > > > > >> > > file), with keeping removal as is, and
> schedulable
> > >> > > per-page
> > >> > > > > > > > migration
> > >> > > > > > > > >> for
> > >> > > > > > > > >> > > the rest of the data during the low activity
> > period.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > Free lists are large and slow by themselves, it is
> > >> expensive
> > >> > > > to
> > >> > > > > > > > >> checkpoint
> > >> > > > > > > > >> > and read them on start, so as a long-term solution
> I
> > >> would
> > >> > > > look
> > >> > > > > > into
> > >> > > > > > > > >> > removing them. Moreover, not sure if adding yet
> > another
> > >> > > > > background
> > >> > > > > > > > >> process
> > >> > > > > > > > >> > will improve the codebase reliability and
> simplicity.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > If we want to go the hard path, I would look at
> free
> > >> page
> > >> > > > > tracking
> > >> > > > > > > > >> bitmap -
> > >> > > > > > > > >> > a special bitmask page, where each page in an
> > adjacent
> > >> block
> > >> > > > is
> > >> > > > > > > marked
> > >> > > > > > > > >> as 0
> > >> > > > > > > > >> > if it has free space more than a certain
> configurable
> > >> > > > threshold
> > >> > > > > > > (say,
> > >> > > > > > > > >> 80%)
> > >> > > > > > > > >> > - free, and 1 if less (full). Some vendors have
> > >> successfully
> > >> > > > > > > > implemented
> > >> > > > > > > > >> > this approach, which looks much more promising, but
> > >> harder
> > >> > > to
> > >> > > > > > > > implement.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > --AG
> > >> > > > > > > > >> >
> > >> > > > > > > > >>
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > >
> > >> > > > > > > > > Best regards,
> > >> > > > > > > > > Alexei Scherbakov
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > --
> > >> > > > > > > >
> > >> > > > > > > > Best regards,
> > >> > > > > > > > Alexei Scherbakov
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Sergey Kozlov
> > >> > GridGain Systems
> > >> > www.gridgain.com
> > >>
> > >
> >
>

Reply via email to