I don't know about the replicated log, but the proposal seems find to me.

Jie/BenM, do you guys have an opinion?

On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham
<sshanmug...@twitter.com.invalid> wrote:

> +1. Aurora will hugely benefit from this change.
>
> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ipro...@twopensource.com>
> wrote:
>
> > Hi everyone,
> >
> > I'd like to propose adding "manual" LevelDB compaction to the
> > replicated log truncation process.
> >
> > Motivation
> >
> > Mesos Master and Aurora Scheduler use the replicated log to persist
> > information about the cluster. This log is periodically truncated to
> > prune outdated log entries. However the replicated log storage is not
> > compacted and grows without bounds. This leads to problems like
> > synchronous failover of all master/scheduler replicas happening
> > because all of them ran out of disk space.
> >
> > The only time when log storage compaction happens is during recovery.
> > Because of that periodic failovers are required to control the
> > replicated log storage growth. But this solution is suboptimal.
> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
> > storage which depending on the cluster can take several minutes.
> > During the downtime tasks cannot be (re-)scheduled and users cannot
> > interact with the service.
> >
> > Proposal
> >
> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
> > work well with LevelDB background compaction algorithm. Fortunately,
> > LevelDB provides a way to force compaction with DB::CompactRange()
> > method. Replicated log storage can trigger it after persisting learned
> > TRUNCATE action and deleting truncated log positions. The compacted
> > range will be from previous first position of the log to the new first
> > position (the one the log was truncated up to).
> >
> > Performance impact
> >
> > Mesos Master and Aurora Scheduler have 2 different replicated log
> > usage profiles. For Mesos Master every registry update (agent
> > (re-)registration/marking, maintenance schedule update, etc.) induces
> > writing a complete snapshot which depending on the cluster size can
> > get pretty big (in a scale test fake cluster with 55k agents it is
> > ~15MB). Every snapshot is followed by a truncation of all previous
> > entries, which doesn't block the registrar and happens kind of in the
> > background. In the scale test cluster with 55k agents compactions
> > after such truncations take ~680ms.
> >
> > To reduce the performance impact for the Master compaction can be
> > triggered only after more than some configurable number of keys were
> > deleted.
> >
> > Aurora Scheduler writes incremental changes of its storage to the
> > replicated log. Every hour a storage snapshot is created and persisted
> > to the log, followed by a truncation of all entries preceding the
> > snapshot. Therefore, storage compactions will be infrequent but will
> > deal with potentially large number of keys. In the scale test cluster
> > such compactions took ~425ms each.
> >
> > Please let me know what you think about it.
> >
> > Thanks!
> >
> > --
> > Ilya Pronin
> >
>

Reply via email to