Hi All,

Need help with a Mesos issue.
I am currently working on the Myriad HA implementation.

Currently we launch NodeManager as a mesos task. With Fine grained scaling,
we also launch
yarn containers as mesos tasks. I learned from Santosh that in order to
launch tasks on the
same executor, the ExecutorInfo object has to be exactly the same.
The ExecutorInfo object now contains the command line for launching the
NodeManager (after NM+ Executor merge)
So I save the ExecutorInfo and this works fine when the Scheduler + RM is
not restarted.

In order to makes sure that Fine grain scaling continues to work after
Scheduler + RM restart, I preserve and retrieve
the ExecutorInfo object from the StateStore (pull request for this is out
for review). I have verified that this storage and
retrieval is working correctly.

When the RM+scheduler restarts and recovers (reads state from state store,
NM's reconnect to RM etc),
I run a job. The Yarn side of things work fine and the job succeeds.
However I don't see the MyriadExecutor get calls to the launchTask method
at all.

In fact when the MyriadExecutor tries to send status for containers
running, I see the following messages in the mesos-slave.INFO log
I0821 18:11:35.606885 27735 slave.cpp:2671] Handling status update
TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task
yarn_container_1440205515169_0002_01_000003 of framework
20150821-140313-2321877514-5050-3434-0006 from executor(1)@
10.10.101.138:36136
W0821 18:11:35.607535 27735 slave.cpp:2715] *Could not find the executor
for status update* TASK_FINISHED (UUID:
a650ca87-8d91-4d93-81dd-a5493249c954) for task
yarn_container_1440205515169_0002_01_000003 of framework
20150821-140313-2321877514-5050-3434-0006
I0821 18:11:35.608109 27757 status_update_manager.cpp:322] Received status
update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task
yarn_container_1440205515169_0002_01_000003 of framework
20150821-140313-2321877514-5050-3434-0006
I0821 18:11:35.608927 27735 slave.cpp:2926] Forwarding the update
TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task
yarn_container_1440205515169_0002_01_000003 of framework
20150821-140313-2321877514-5050-3434-0006 to [email protected]:5050
I0821 18:11:35.609191 27735 slave.cpp:2856] Sending acknowledgement for
status update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954)
for task yarn_container_1440205515169_0002_01_000003 of framework
20150821-140313-2321877514-5050-3434-0006 to executor(1)@10.10.101.138:36136
I0821 18:11:35.613358 27750 status_update_manager.cpp:394] Received status
update acknowledgement (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for
task yarn_container_1440205515169_0002_01_000003 of framework
20150821-140313-2321877514-5050-3434-0006

I believe Mesos is not creating a mesos task for these yarn containers
(I don't see the  yarn_container_<container_id> tasks in the mesos UI after
Scheduler + RM restart, before restart I do see these tasks).

I am trying to understand if there anything I may be missing that's causing
the executor to not receive the launchTask call after Scheduler + RM
restart.

Regards
Swapnil

Reply via email to