Hi everyone,

I'd like to propose adding "manual" LevelDB compaction to the
replicated log truncation process.

Motivation

Mesos Master and Aurora Scheduler use the replicated log to persist
information about the cluster. This log is periodically truncated to
prune outdated log entries. However the replicated log storage is not
compacted and grows without bounds. This leads to problems like
synchronous failover of all master/scheduler replicas happening
because all of them ran out of disk space.

The only time when log storage compaction happens is during recovery.
Because of that periodic failovers are required to control the
replicated log storage growth. But this solution is suboptimal.
Failovers are not instant: e.g. Aurora Scheduler needs to recover the
storage which depending on the cluster can take several minutes.
During the downtime tasks cannot be (re-)scheduled and users cannot
interact with the service.

Proposal

In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
work well with LevelDB background compaction algorithm. Fortunately,
LevelDB provides a way to force compaction with DB::CompactRange()
method. Replicated log storage can trigger it after persisting learned
TRUNCATE action and deleting truncated log positions. The compacted
range will be from previous first position of the log to the new first
position (the one the log was truncated up to).

Performance impact

Mesos Master and Aurora Scheduler have 2 different replicated log
usage profiles. For Mesos Master every registry update (agent
(re-)registration/marking, maintenance schedule update, etc.) induces
writing a complete snapshot which depending on the cluster size can
get pretty big (in a scale test fake cluster with 55k agents it is
~15MB). Every snapshot is followed by a truncation of all previous
entries, which doesn't block the registrar and happens kind of in the
background. In the scale test cluster with 55k agents compactions
after such truncations take ~680ms.

To reduce the performance impact for the Master compaction can be
triggered only after more than some configurable number of keys were
deleted.

Aurora Scheduler writes incremental changes of its storage to the
replicated log. Every hour a storage snapshot is created and persisted
to the log, followed by a truncation of all entries preceding the
snapshot. Therefore, storage compactions will be infrequent but will
deal with potentially large number of keys. In the scale test cluster
such compactions took ~425ms each.

Please let me know what you think about it.

Thanks!

-- 
Ilya Pronin

Reply via email to