+1. Aurora will hugely benefit from this change.

On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ipro...@twopensource.com> wrote:

> Hi everyone,
>
> I'd like to propose adding "manual" LevelDB compaction to the
> replicated log truncation process.
>
> Motivation
>
> Mesos Master and Aurora Scheduler use the replicated log to persist
> information about the cluster. This log is periodically truncated to
> prune outdated log entries. However the replicated log storage is not
> compacted and grows without bounds. This leads to problems like
> synchronous failover of all master/scheduler replicas happening
> because all of them ran out of disk space.
>
> The only time when log storage compaction happens is during recovery.
> Because of that periodic failovers are required to control the
> replicated log storage growth. But this solution is suboptimal.
> Failovers are not instant: e.g. Aurora Scheduler needs to recover the
> storage which depending on the cluster can take several minutes.
> During the downtime tasks cannot be (re-)scheduled and users cannot
> interact with the service.
>
> Proposal
>
> In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
> work well with LevelDB background compaction algorithm. Fortunately,
> LevelDB provides a way to force compaction with DB::CompactRange()
> method. Replicated log storage can trigger it after persisting learned
> TRUNCATE action and deleting truncated log positions. The compacted
> range will be from previous first position of the log to the new first
> position (the one the log was truncated up to).
>
> Performance impact
>
> Mesos Master and Aurora Scheduler have 2 different replicated log
> usage profiles. For Mesos Master every registry update (agent
> (re-)registration/marking, maintenance schedule update, etc.) induces
> writing a complete snapshot which depending on the cluster size can
> get pretty big (in a scale test fake cluster with 55k agents it is
> ~15MB). Every snapshot is followed by a truncation of all previous
> entries, which doesn't block the registrar and happens kind of in the
> background. In the scale test cluster with 55k agents compactions
> after such truncations take ~680ms.
>
> To reduce the performance impact for the Master compaction can be
> triggered only after more than some configurable number of keys were
> deleted.
>
> Aurora Scheduler writes incremental changes of its storage to the
> replicated log. Every hour a storage snapshot is created and persisted
> to the log, followed by a truncation of all entries preceding the
> snapshot. Therefore, storage compactions will be infrequent but will
> deal with potentially large number of keys. In the scale test cluster
> such compactions took ~425ms each.
>
> Please let me know what you think about it.
>
> Thanks!
>
> --
> Ilya Pronin
>

Reply via email to