[ https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353918#comment-17353918 ]
clancyhuang commented on MESOS-10221: ------------------------------------- Yes, I chose some of the logs, because there are also logs of many runnable tasks, if you need more context information, I can provide it. In addition, according to the framework running log, the accept information is sent immediately after the offer is received, but the accept information in the master log is far behind the send offer, so is it that the accept has not been processed immediately, or is it that I have a wrong understanding of the time of the send offer. As for what causes an offer to be removed, I suspect it is the offer timeout mechanism. From the log, the time interval between removing an offer and sending an offer is just 30 seconds. This is exactly the offer timeout configuration that we set. However, we tried to increase it to 1 minute, but the same problem still occurred. This time, the interval became 1 minute. About RESCIND, we didn't deal with it. I will try to find out the cause of the problem after adding handler of rescind. Thank you for your help. > A large number of TASK_LOST causes the task to be unable to run > --------------------------------------------------------------- > > Key: MESOS-10221 > URL: https://issues.apache.org/jira/browse/MESOS-10221 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.9.0, 1.11.0 > Environment: Ubuntu 16.04 > Reporter: clancyhuang > Priority: Major > > Recently, we found that the mesos master frequently generates Task lost > exceptions after task submission, and retrying in a short period of time is > not feasible, and it is becoming more and more frequent. > We selected two abnormal logs > {code:java} > I0528 15:09:55.367336 964 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:10:25.369561 969 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 > I0528 15:10:43.383028 959 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > I0528 15:10:43.383656 959 master.cpp:5434] Processing DECLINE call for > offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 > seconds filter > I0528 15:10:03.385080 971 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:10:33.386322 972 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 > I0528 15:10:57.181581 967 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > W0528 15:10:57.183194 967 master.cpp:3959] Ignoring accept of offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid > W0528 15:10:57.183265 967 master.cpp:3964] ACCEPT call used invalid offers > '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid > I0528 15:10:57.184392 967 master.cpp:8212] Sending status update TASK_LOST > for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: > Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid' > {code} > The following is a log of normal execution > {code:java} > I0528 15:17:03.690855 959 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529, > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:17:03.742848 970 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > I0528 15:17:03.745221 970 master.cpp:4356] Processing ACCEPT call for > offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent > cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for > framework 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:17:03.745889 970 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 > {code} > We found that the offer was cancelled before accept when the exception > occurred,and the interval time is just the configured offer-timeout. Our > framework communicates with mesos based on http, I am sure that he sends the > accept message immediately after receiving the offer and the request is > successful. > The question is why sometimes the master processes the accept message after > the offer times out. In addition, we tried to increase the offer-timeout, but > the problem was not resolved -- This message was sent by Atlassian Jira (v8.3.4#803005)