Hi, We have framework that launched Spark jobs on our Mesos cluster. We are currently having an issue where Spark jobs are getting stuck due to some timeout issue. We have cancel functionality that would kill send task_kill message to master. When the jobs get stuck Spark driver task is not getting killed even though the agent on the node that driver is running get the kill request. Is there any timeout that I can set so that Mesos agent can force kill the task in this scenario? Really appreciate your help.
Thanks, Venkat Log entry from agent logs: I0404 03:44:47.367276 55066 slave.cpp:2035] Asked to kill task 79668.0.0 of framework 35e600c2-6f43-402c-856f-9084c0040187-002