Re: Question on status update retry in agent

2018-04-18 Thread Benjamin Mahler
I'm not following what the bug is. The code you pointed to is called from here: https://github.com/apache/mesos/blob/1.4.0/src/slave/status_update_manager.cpp#L762-L776 Where we ignore duplicates and also ensure that the ack matches the latest update we've sent. So, from the code you pointed to

Re: Question on status update retry in agent

2018-04-16 Thread Varun Gupta
We use explicit ack from Scheduler. Here, is a snippet of the logs. Please see logs for Status Update UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef W0416 00:41:25.843505 124530 status_update_manager.cpp:761] Duplicate status update acknowledgment (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for

Re: Question on status update retry in agent

2018-04-10 Thread Benjamin Mahler
Do you have logs? Which acknowledgements did the agent receive? Which TASK_RUNNING in the sequence was it re-sending? On Tue, Apr 10, 2018 at 6:41 PM, Benjamin Mahler wrote: > > Issue is that, *old executor reference is hold by slave* (assuming it > did not receive

Re: Question on status update retry in agent

2018-04-10 Thread Benjamin Mahler
> Issue is that, *old executor reference is hold by slave* (assuming it did not receive acknowledgement, whereas master and scheduler have processed the status updates), so it continues to retry TASK_RUNNING infinitely. The agent only retries so long as it does not get an acknowledgement, is the

Re: Question on status update retry in agent

2018-04-09 Thread Varun Gupta
Hi, We are running into an issue with slave status update manager. Below is the behavior I am seeing. Our use case is, we run Stateful container (Cassandra process), here Executor polls JMX port at 60 second interval to get Cassandra State and sends the state to agent -> master -> framework.

Re: Question on status update retry in agent

2018-03-16 Thread Benjamin Mahler
(1) Assuming you're referring to the scheduler's acknowledgement of a status update, the agent will not forward TS2 until TS1 has been acknowledged. So, TS2 will not be acknowledged before TS1 is acknowledged. FWICT, we'll ignore any violation of this ordering and log a warning. (2) To reverse

Question on status update retry in agent

2018-03-15 Thread Zhitao Li
Hi, While designing the correct behavior with one of our framework, we encounters some questions about behavior of status update: The executor continuously polls the workload probe to get current mode of workload (a Cassandra server), and send various status update states (STARTING, RUNNING,