[ 
https://issues.apache.org/jira/browse/MESOS-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15813443#comment-15813443
 ] 

Benjamin Mahler commented on MESOS-6596:
----------------------------------------

[~mcypark] [~zhitao] the plan is to move offer ownership into the allocator 
(see MESOS-4553), at which point the allocator can apply an operation by 
invalidating offers as necessary. This would avoid the hack in the master and 
avoid the race. 

Note that with this change, there will still be a race to apply a dynamic 
reservation in the case that the resources are no longer available on the agent 
(i.e. tasks/executors are using them). This race will be mitigated by avoiding 
the allocator backups as is currently being worked on as part of the batching 
effort (see the patches in MESOS-3157 but note that we will file a new ticket 
that captures what is being done since it is not what MESOS-3157 asked for). If 
this turns out to not be enough, we can discuss further techniques (e.g. add an 
operation timeout to wait for available resources, maintenance integration to 
drain enough resources, etc).

> Dynamic reservation endpoint returns 409s
> -----------------------------------------
>
>                 Key: MESOS-6596
>                 URL: https://issues.apache.org/jira/browse/MESOS-6596
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Kunal Thakar
>
> The operation to dynamically reserve a host for a framework consistently 
> fails, but succeeds sometimes.
> We are calling the /reserve endpoint on the master with the same payload and 
> it mostly returns 409, with the occasional success. Pasting the output of two 
> consecutive /reserve calls:
> {code}
> * About to connect() to computexxx-yyy port 5050 (#0)
> *   Trying 10.184.21.3... connected
> * Server auth using Basic with user 'cassandra'
> > POST /master/reserve HTTP/1.1
> > Authorization: Basic blah
> > User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.2j 
> > zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> > Host: computexxx-yyy:5050
> > Accept: */*
> > Content-Length: 1046
> > Content-Type: application/x-www-form-urlencoded
> > Expect: 100-continue
> >
> * Done waiting for 100-continue
> < HTTP/1.1 409 Conflict
> HTTP/1.1 409 Conflict
> < Date: Tue, 15 Nov 2016 23:07:10 GMT
> Date: Tue, 15 Nov 2016 23:07:10 GMT
> < Content-Type: text/plain; charset=utf-8
> Content-Type: text/plain; charset=utf-8
> < Content-Length: 58
> Content-Length: 58
> * HTTP error before end of send, stop sending
> <
> * Closing connection #0
> Invalid RESERVE Operation:  does not contain mem(*):120621
> {code}
> {code}
> * About to connect() to computexxx-yyy port 5050 (#0)
> *   Trying 10.184.21.3... connected
> * Server auth using Basic with user 'cassandra'
> > POST /master/reserve HTTP/1.1
> > Authorization: Basic blah
> > User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.2j 
> > zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> > Host: computexxx-yyy:5050
> > Accept: */*
> > Content-Length: 1046
> > Content-Type: application/x-www-form-urlencoded
> > Expect: 100-continue
> >
> * Done waiting for 100-continue
> < HTTP/1.1 202 Accepted
> HTTP/1.1 202 Accepted
> < Date: Tue, 15 Nov 2016 23:07:16 GMT
> Date: Tue, 15 Nov 2016 23:07:16 GMT
> < Content-Length: 0
> Content-Length: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to