Hi All, Need help with a Mesos issue. I am currently working on the Myriad HA implementation.
Currently we launch NodeManager as a mesos task. With Fine grained scaling, we also launch yarn containers as mesos tasks. I learned from Santosh that in order to launch tasks on the same executor, the ExecutorInfo object has to be exactly the same. The ExecutorInfo object now contains the command line for launching the NodeManager (after NM+ Executor merge) So I save the ExecutorInfo and this works fine when the Scheduler + RM is not restarted. In order to makes sure that Fine grain scaling continues to work after Scheduler + RM restart, I preserve and retrieve the ExecutorInfo object from the StateStore (pull request for this is out for review). I have verified that this storage and retrieval is working correctly. When the RM+scheduler restarts and recovers (reads state from state store, NM's reconnect to RM etc), I run a job. The Yarn side of things work fine and the job succeeds. However I don't see the MyriadExecutor get calls to the launchTask method at all. In fact when the MyriadExecutor tries to send status for containers running, I see the following messages in the mesos-slave.INFO log I0821 18:11:35.606885 27735 slave.cpp:2671] Handling status update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task yarn_container_1440205515169_0002_01_000003 of framework 20150821-140313-2321877514-5050-3434-0006 from executor(1)@ 10.10.101.138:36136 W0821 18:11:35.607535 27735 slave.cpp:2715] *Could not find the executor for status update* TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task yarn_container_1440205515169_0002_01_000003 of framework 20150821-140313-2321877514-5050-3434-0006 I0821 18:11:35.608109 27757 status_update_manager.cpp:322] Received status update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task yarn_container_1440205515169_0002_01_000003 of framework 20150821-140313-2321877514-5050-3434-0006 I0821 18:11:35.608927 27735 slave.cpp:2926] Forwarding the update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task yarn_container_1440205515169_0002_01_000003 of framework 20150821-140313-2321877514-5050-3434-0006 to [email protected]:5050 I0821 18:11:35.609191 27735 slave.cpp:2856] Sending acknowledgement for status update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task yarn_container_1440205515169_0002_01_000003 of framework 20150821-140313-2321877514-5050-3434-0006 to executor(1)@10.10.101.138:36136 I0821 18:11:35.613358 27750 status_update_manager.cpp:394] Received status update acknowledgement (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task yarn_container_1440205515169_0002_01_000003 of framework 20150821-140313-2321877514-5050-3434-0006 I believe Mesos is not creating a mesos task for these yarn containers (I don't see the yarn_container_<container_id> tasks in the mesos UI after Scheduler + RM restart, before restart I do see these tasks). I am trying to understand if there anything I may be missing that's causing the executor to not receive the launchTask call after Scheduler + RM restart. Regards Swapnil
