[ 
https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353855#comment-17353855
 ] 

Charles Natali commented on MESOS-10221:
----------------------------------------

Hey [~934341445],

Receiving {{TASK_LOST}} upon a stale offer is perfectly fine and can occur in a 
normal and healthy cluster, and so should therefore be handled by the framework.

Here's an annotated log:
{noformat}
# the master sends an offer to the framework
I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 
24b62b35-26d6-4a13-ba75-
d84ce5fed64e-0005 (Test HTTP Framework)
# the master removes the offer: from that point, it is not valid, and any task 
submitted against it will be rejected with TASK_LOST
I0528 15:10:33.386322   972 master.cpp:11878] Removing offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
# here the master receives an ACCEPT from the framework using this offer, which 
isn't valid anymore
I0528 15:10:57.181581   967 http.cpp:1436] HTTP POST for 
/master/api/v1/scheduler from 10.118.28.66:50484 with 
User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
# and therefore rejects it
W0528 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers '[ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
I0528 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST 
for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: 
Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid'
{noformat}
 

However one thing which I notice from the above log is that there is a 24s gap 
between the master removing the offer (at 15:10:33.386322) and the framework 
trying to accept it (at 15:10:57.181581): normally, the master should have sent 
a {{RESCIND}} to the framework when the offer was removed (see 
[http://mesos.apache.org/documentation/latest/scheduler-http-api/#rescind]).
 Does your framework handle RESCIND? If not, this would make such rejections 
with {{TASK_LOST}} much more frequent than if it did.

Also, do you know what triggered the offer to be removed? One common cause is 
if an agent is disconnected for example, does that happen a lot in your cluster?
 What happened in this specific example, I'm surprised to not see more context 
in the log, did you filter out some lines?

> A large number of TASK_LOST causes the task to be unable to run
> ---------------------------------------------------------------
>
>                 Key: MESOS-10221
>                 URL: https://issues.apache.org/jira/browse/MESOS-10221
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.9.0, 1.11.0
>         Environment: Ubuntu 16.04
>            Reporter: clancyhuang
>            Priority: Major
>
> Recently, we found that the mesos master frequently generates Task lost 
> exceptions after task submission, and retrying in a short period of time is 
> not feasible, and it is becoming more and more frequent.
>  We selected two abnormal logs
> {code:java}
> I0528 15:09:55.367336   964 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:25.369561   969 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237
> I0528 15:10:43.383028   959 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:10:43.383656   959 master.cpp:5434] Processing DECLINE call for 
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 
> seconds filter
> I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:33.386322   972 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
> I0528 15:10:57.181581   967 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> W0528 15:10:57.183194   967 master.cpp:3959] Ignoring accept of offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid
> W0528 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers 
> '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
> I0528 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST 
> for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: 
> Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid'
> {code}
> The following is a log of normal execution
> {code:java}
> I0528 15:17:03.690855   959 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529, 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.742848   970 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:17:03.745221   970 master.cpp:4356] Processing ACCEPT call for 
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent 
> cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for 
> framework 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.745889   970 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529
> {code}
> We found that the offer was cancelled before accept when the exception 
> occurred,and the interval time is just the configured offer-timeout. Our 
> framework communicates with mesos based on http, I am sure that he sends the 
> accept message immediately after receiving the offer and the request is 
> successful.
>  The question is why sometimes the master processes the accept message after 
> the offer times out. In addition, we tried to increase the offer-timeout, but 
> the problem was not resolved



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to