I don't know about the replicated log, but the proposal seems find to me. Jie/BenM, do you guys have an opinion?
On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham <sshanmug...@twitter.com.invalid> wrote: > +1. Aurora will hugely benefit from this change. > > On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ipro...@twopensource.com> > wrote: > > > Hi everyone, > > > > I'd like to propose adding "manual" LevelDB compaction to the > > replicated log truncation process. > > > > Motivation > > > > Mesos Master and Aurora Scheduler use the replicated log to persist > > information about the cluster. This log is periodically truncated to > > prune outdated log entries. However the replicated log storage is not > > compacted and grows without bounds. This leads to problems like > > synchronous failover of all master/scheduler replicas happening > > because all of them ran out of disk space. > > > > The only time when log storage compaction happens is during recovery. > > Because of that periodic failovers are required to control the > > replicated log storage growth. But this solution is suboptimal. > > Failovers are not instant: e.g. Aurora Scheduler needs to recover the > > storage which depending on the cluster can take several minutes. > > During the downtime tasks cannot be (re-)scheduled and users cannot > > interact with the service. > > > > Proposal > > > > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t > > work well with LevelDB background compaction algorithm. Fortunately, > > LevelDB provides a way to force compaction with DB::CompactRange() > > method. Replicated log storage can trigger it after persisting learned > > TRUNCATE action and deleting truncated log positions. The compacted > > range will be from previous first position of the log to the new first > > position (the one the log was truncated up to). > > > > Performance impact > > > > Mesos Master and Aurora Scheduler have 2 different replicated log > > usage profiles. For Mesos Master every registry update (agent > > (re-)registration/marking, maintenance schedule update, etc.) induces > > writing a complete snapshot which depending on the cluster size can > > get pretty big (in a scale test fake cluster with 55k agents it is > > ~15MB). Every snapshot is followed by a truncation of all previous > > entries, which doesn't block the registrar and happens kind of in the > > background. In the scale test cluster with 55k agents compactions > > after such truncations take ~680ms. > > > > To reduce the performance impact for the Master compaction can be > > triggered only after more than some configurable number of keys were > > deleted. > > > > Aurora Scheduler writes incremental changes of its storage to the > > replicated log. Every hour a storage snapshot is created and persisted > > to the log, followed by a truncation of all entries preceding the > > snapshot. Therefore, storage compactions will be infrequent but will > > deal with potentially large number of keys. In the scale test cluster > > such compactions took ~425ms each. > > > > Please let me know what you think about it. > > > > Thanks! > > > > -- > > Ilya Pronin > > >