[jira] [Commented] (MESOS-354) oversubscribe resources

Benjamin Hindman (JIRA) Wed, 27 Feb 2013 10:45:18 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588615#comment-13588615
 ]


Benjamin Hindman commented on MESOS-354:
----------------------------------------

Thanks for writing this out Brian.

I'd prefer to keep things as simple as possible, i.e., a boolean flag per offer 
indicating whether or not the resources within the offer are revocable. This 
also captures the design we (Berkeley) had discussed in the past.

It is true that the current semantics are to only have one offer per slave per 
offers callback, but I don't think we need to adhere to this going forward. 
That is, I think it's perfectly reasonable to have two offers for the same 
slave in the list of offers where one is for revocable resources and one is for 
non-revocable resources. A long standing desire is to enable schedulers to 
aggregate offers on the same slave. One could imagine the aggregate only being 
tainted/revocable if one of the offers contains revocable resources. Enabling 
aggregate offers might be a great starter project actually (I think there is a 
JIRA out there for this).  

Also, I'm not keen on handcuffing the allocator with the semantics that 
non-revocable resources will never be revoked. That is to say, I can imagine 
sophisticated allocators "reallocating" resources amongst frameworks in order 
to "defrag" the cluster for better utilization, to turn off machines, or enable 
running more tasks. We've always played with the idea of masking these 
revocations as machine failures (i.e., TASK_LOST), assuming that more resources 
will be allocated to the framework ASAP. But we might be able to capture this 
more explicitly. For example, one could imagine a "reallocated" callback that 
offers resources to replace what was revoked. I'm all ears if you have ideas 
for better capturing these semantics via the API.

Finally (and related to above), in conjunction with revocation (not so much 
oversubscription) I'd like to introduce "inverse offers": a request for the 
scheduler to kill it's own tasks in order to free up resources in the cluster. 
Like other things in Mesos, this enables the scheduler to be involved in the 
process if it wants to be (if it doesn't, the system will just decide what to 
revoke). I'll attach a poster I had previously created with a lot of these 
ideas.

Note that it's not clear we need/should design all these bits in order to get 
to oversubscription. Just adding the revocable boolean might be sufficient for 
now. I just want to make sure that we don't walk ourselves into a corner where 
some of these other features/mechanisms will become very tedious to introduce.
                
> oversubscribe resources
> -----------------------
>
>                 Key: MESOS-354
>                 URL: https://issues.apache.org/jira/browse/MESOS-354
>             Project: Mesos
>          Issue Type: New Feature
>          Components: isolation, master, slave
>            Reporter: brian wickman
>            Priority: Minor
>
> This proposal is predicated upon offer revocation.
> The idea would be to add a new "revoked" status either by (1) piggybacking 
> off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a 
> new status update TASK_REVOKED.
> In order to augment an offer with metadata about revocability, there are 
> options:
>   1) Add a revocable boolean to the Offer and
>     a) offer only one type of Offer per slave at a particular time
>     b) offer both revocable and non-revocable resources at the same time but 
> require frameworks to understand that Offers can contain overlapping resources
>   2) Add a revocable_resources field on the Offer which is a superset of the 
> regular resources field.  By consuming > resources <= revocable_resources in 
> a launchTask, the Task becomes a revocable task.  If launching a task with < 
> resources, the Task is non-revocable.
> The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) 
> and non-revocable tasks are online higher-SLA tasks (e.g. services.)
> Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk.  
> One of these resources is a rate (4 cpu seconds per second) and two of them 
> are fixed values (8GB and 20GB respectively, though disk resources can be 
> further broken down into spindles - fixed - and iops - a rate.)  In practice, 
> these are the maximum resources in the respective dimensions that this task 
> will use.  In reality, we provision tasks at some factor below peak, and only 
> hit peak resource consumption in rare circumstances or perhaps at a diurnal 
> peak.  
> In the meantime, we stand to gain from offering the some constant factor of 
> the difference between (reserved - actual) of non-revocable tasks as 
> revocable resources, depending upon our tolerance for revocable task churn.  
> The main challenge is coming up with an accurate short / medium / long-term 
> prediction of resource consumption based upon current behavior.
> In many cases it would be OK to be sloppy:
>   * CPU / iops / network IO are rates (compressible) and can often be OK 
> below guarantees for brief periods of time while task revocation takes place
>   * Memory slack can be provided by enabling swap and dynamically setting 
> swap paging boundaries.  Should swap ever be activated, that would be a 
> signal to revoke.
> The master / allocator would piggyback on the slave heartbeat mechanism to 
> learn of the amount of revocable resources available at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-354) oversubscribe resources

Reply via email to