Hi Ben, Yes, we do plan to work on this. Thanks a lot for raising the concern! We will change the first task to update the design doc and revive the discussion around this.
Cheers, Artem. On Tue, Jul 7, 2015 at 3:20 PM, Benjamin Mahler <[email protected]> wrote: > Hm.. are you guys planning to start working on this? > > We should revisit the design here: > https://docs.google.com/document/d/1CIoOnBLFiEvmhOe-h_s8M4m9Qa7BLETuj_dSNJW959U/edit > > Specifically, after getting feedback from folks working on storage systems, > they seem to really want the safety of explicit acceptance of maintenance. > This complicates things a bit, because previously we simplified this by > relying on a lack of things running to signal implicit acceptance. Once > explicit acceptance is required, we need to be allowing frameworks to > accept maintenance even if they don't have anything running on the slave. > Ideally, we don't ask all frameworks about all slaves, for example, by only > asking when they allocation rights (e.g. reservations, quota, etc). > > On Tue, Jul 7, 2015 at 3:03 PM, Artem Harutyunyan (JIRA) <[email protected]> > wrote: > >> >> [ >> https://issues.apache.org/jira/browse/MESOS-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> ] >> >> Artem Harutyunyan updated MESOS-2075: >> ------------------------------------- >> Sprint: Mesosphere Sprint 14 >> Labels: mesosphere twitter (was: twitter) >> Fix Version/s: 0.24.0 >> >> > Add maintenance information to the replicated registry. >> > ------------------------------------------------------- >> > >> > Key: MESOS-2075 >> > URL: https://issues.apache.org/jira/browse/MESOS-2075 >> > Project: Mesos >> > Issue Type: Task >> > Components: master >> > Reporter: Benjamin Mahler >> > Labels: mesosphere, twitter >> > Fix For: 0.24.0 >> > >> > >> > To achieve fault-tolerance for the maintenance primitives, we will need >> to add the maintenance information to the registry. >> > The registry currently stores all of the slave information, which is >> quite large (~ 17MB for 50,000 slaves from my testing), which results in a >> protobuf object that is extremely expensive to copy. >> > As far as I can tell, reads / writes to maintenance information is >> independent of reads / writes to the existing 'registry' information. So >> there are two approach here: >> > h4. Add maintenance information to 'maintenance' key: >> > # The advantage of this approach is that we don't further grow the large >> Registry object. >> > # This approach assumes that writes to 'maintenance' are independent of >> writes to the 'registry'. If these writes are not independent, this >> approach requires that we add transactional support to the State >> abstraction. >> > # This approach requires adding compaction to LogStorage. >> > # This approach likely requires some refactoring to the Registrar. >> > h4. Add maintenance information to 'registry' key: >> > # The advantage of this approach is that it's the easiest to implement. >> > # This will further grow the single 'registry' object, but doesn't >> preclude it being split apart in the future. >> > # This approach may require using the diff support in LogStorage and/or >> adding compression support to LogStorage snapshots to deal with the >> increased size of the registry. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v6.3.4#6332) >>
