[
https://issues.apache.org/jira/browse/MESOS-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806354#comment-15806354
]
Michael Park commented on MESOS-6596:
-------------------------------------
[~zhitao] What is the {{allocation_interval}} for the cluster, and how many
frameworks are in play?
I think [~kaysoky] is right in that you are indeed running into the
{{allocate}} vs {{updateAvailable}} race.
We initially tried to "practically" get around the issue with this piece of
code:
https://github.com/apache/mesos/blob/1.1.0/src%2Fmaster%2Fhttp.cpp#L4599-L4606
which was a hack to begin with, and seems that it's not good enough practically
because the {{Filter}} is only applied to the specific framework.
There have been thoughts about making the master/allocator have a much closer
relationship, but I think that's a much bigger undertaking.
Meanwhile, I think we could consider something like: adding a call to the
allocator to request leaving room for specified resources,
so that the batch {{allocate}} doesn't flush all of the resources before
{{updateAvailable}} call gets processed by the allocator.
> Dynamic reservation endpoint returns 409s
> -----------------------------------------
>
> Key: MESOS-6596
> URL: https://issues.apache.org/jira/browse/MESOS-6596
> Project: Mesos
> Issue Type: Bug
> Components: master
> Reporter: Kunal Thakar
>
> The operation to dynamically reserve a host for a framework consistently
> fails, but succeeds sometimes.
> We are calling the /reserve endpoint on the master with the same payload and
> it mostly returns 409, with the occasional success. Pasting the output of two
> consecutive /reserve calls:
> {code}
> * About to connect() to computexxx-yyy port 5050 (#0)
> * Trying 10.184.21.3... connected
> * Server auth using Basic with user 'cassandra'
> > POST /master/reserve HTTP/1.1
> > Authorization: Basic blah
> > User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.2j
> > zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> > Host: computexxx-yyy:5050
> > Accept: */*
> > Content-Length: 1046
> > Content-Type: application/x-www-form-urlencoded
> > Expect: 100-continue
> >
> * Done waiting for 100-continue
> < HTTP/1.1 409 Conflict
> HTTP/1.1 409 Conflict
> < Date: Tue, 15 Nov 2016 23:07:10 GMT
> Date: Tue, 15 Nov 2016 23:07:10 GMT
> < Content-Type: text/plain; charset=utf-8
> Content-Type: text/plain; charset=utf-8
> < Content-Length: 58
> Content-Length: 58
> * HTTP error before end of send, stop sending
> <
> * Closing connection #0
> Invalid RESERVE Operation: does not contain mem(*):120621
> {code}
> {code}
> * About to connect() to computexxx-yyy port 5050 (#0)
> * Trying 10.184.21.3... connected
> * Server auth using Basic with user 'cassandra'
> > POST /master/reserve HTTP/1.1
> > Authorization: Basic blah
> > User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.2j
> > zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> > Host: computexxx-yyy:5050
> > Accept: */*
> > Content-Length: 1046
> > Content-Type: application/x-www-form-urlencoded
> > Expect: 100-continue
> >
> * Done waiting for 100-continue
> < HTTP/1.1 202 Accepted
> HTTP/1.1 202 Accepted
> < Date: Tue, 15 Nov 2016 23:07:16 GMT
> Date: Tue, 15 Nov 2016 23:07:16 GMT
> < Content-Length: 0
> Content-Length: 0
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)