Hi Ben,

Yes, we do plan to work on this. Thanks a lot for raising the concern!
We will change the first task to update the design doc and revive the
discussion around this.

Cheers,
Artem.

On Tue, Jul 7, 2015 at 3:20 PM, Benjamin Mahler
<[email protected]> wrote:
> Hm.. are you guys planning to start working on this?
>
> We should revisit the design here:
> https://docs.google.com/document/d/1CIoOnBLFiEvmhOe-h_s8M4m9Qa7BLETuj_dSNJW959U/edit
>
> Specifically, after getting feedback from folks working on storage systems,
> they seem to really want the safety of explicit acceptance of maintenance.
> This complicates things a bit, because previously we simplified this by
> relying on a lack of things running to signal implicit acceptance. Once
> explicit acceptance is required, we need to be allowing frameworks to
> accept maintenance even if they don't have anything running on the slave.
> Ideally, we don't ask all frameworks about all slaves, for example, by only
> asking when they allocation rights (e.g. reservations, quota, etc).
>
> On Tue, Jul 7, 2015 at 3:03 PM, Artem Harutyunyan (JIRA) <[email protected]>
> wrote:
>
>>
>>      [
>> https://issues.apache.org/jira/browse/MESOS-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>>
>> Artem Harutyunyan updated MESOS-2075:
>> -------------------------------------
>>            Sprint: Mesosphere Sprint 14
>>            Labels: mesosphere twitter  (was: twitter)
>>     Fix Version/s: 0.24.0
>>
>> > Add maintenance information to the replicated registry.
>> > -------------------------------------------------------
>> >
>> >                 Key: MESOS-2075
>> >                 URL: https://issues.apache.org/jira/browse/MESOS-2075
>> >             Project: Mesos
>> >          Issue Type: Task
>> >          Components: master
>> >            Reporter: Benjamin Mahler
>> >              Labels: mesosphere, twitter
>> >             Fix For: 0.24.0
>> >
>> >
>> > To achieve fault-tolerance for the maintenance primitives, we will need
>> to add the maintenance information to the registry.
>> > The registry currently stores all of the slave information, which is
>> quite large (~ 17MB for 50,000 slaves from my testing), which results in a
>> protobuf object that is extremely expensive to copy.
>> > As far as I can tell, reads / writes to maintenance information is
>> independent of reads / writes to the existing 'registry' information. So
>> there are two approach here:
>> > h4. Add maintenance information to 'maintenance' key:
>> > # The advantage of this approach is that we don't further grow the large
>> Registry object.
>> > # This approach assumes that writes to 'maintenance' are independent of
>> writes to the 'registry'. If these writes are not independent, this
>> approach requires that we add transactional support to the State
>> abstraction.
>> > # This approach requires adding compaction to LogStorage.
>> > # This approach likely requires some refactoring to the Registrar.
>> > h4. Add maintenance information to 'registry' key:
>> > # The advantage of this approach is that it's the easiest to implement.
>> > # This will further grow the single 'registry' object, but doesn't
>> preclude it being split apart in the future.
>> > # This approach may require using the diff support in LogStorage and/or
>> adding compression support to LogStorage snapshots to deal with the
>> increased size of the registry.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>>

Reply via email to