Fwd: Not receiving calls to Executor after Scheduler restart

Swapnil Daingade Thu, 27 Aug 2015 02:40:24 -0700

The Apache email server seems to have some limit on the size of attachment.
My message is bouncing when I try to send the logs to the list.
Will send Adam the logs directly.

Regards
Swapnil

---------- Forwarded message ----------
From: Swapnil Daingade <[email protected]>
Date: Thu, Aug 27, 2015 at 2:24 AM
Subject: Re: Not receiving calls to Executor after Scheduler restart
To: [email protected]

Hi Adam,

Thank you for offering to help with this.

Attaching logs as per our discussion during the call this morning.
Deployed a 2 node mesos cluster

10.10.101.138 had mesos-master, zookeeper, mesos-slave, mesos-dns, marathon
10.10.101.140 had mesos-slave

Launched the RM using Marathon with HA enabled.

After RM starts (on 10.10.101.140), it automatically launches a NM that
uses course grain scaling.
Did a flex up with zero profile NM. Now there are 2 NM's (one FGS, one CGS).

Both NM's show up as active tasks.

Launched a yarn job. Can see mesos tasks corresponding to yarn containers
launched on FGS node.
No placeholder mesos tasks are seen on the CGS node as expected.
yarn job completes successfully.

Killed RM. Marathon restarts another instance of RM (on 10.10.101.138).
RM recovers state from state store, reconnects with NM's.

Deleted the output directory for the previous yarn job and launched the
same job again.
This time I cannot see mesos tasks corresponding to yarn containers
launched on FGS node.
yarn job completes successfully.

Logs attached. The WARNING and INFO logs for mesos-master and mesos-slave
seem interesting.

The node 10.10.101.138 is running the CGS NM. you can ignore the "Could not
lookup task for status update" warning
messages for this NM. The NM is reporting task updates for yarn containers
even when it is CGS NM and it does not
launch placeholder tasks.

However the node 10.10.101.140 is running FGS NM. It should be launching
placeholder tasks.
instead, I see log messages like
ACCEPT call used invalid offers '[
20150826-222858-2321877514-5050-1026-O645 ]': Offer
20150826-222858-2321877514-5050-1026-O645 is no longer valid

On the scheduler side, I am getting TASK_LOST statusUpdates for attempts to
launch placeholder tasks after RM restart.

task_id {
  value: "yarn_container_e01_1440665526715_0001_01_000004"
}
state: TASK_LOST
message: "Task launched with invalid offers: Offer
20150827-002328-2321877514-5050-20344-O2892 is no longer valid"
slave_id {
  value: "20150827-002328-2321877514-5050-20344-S0"
}
timestamp: 1.440665650436197E9
source: SOURCE_MASTER
reason: REASON_INVALID_OFFERS

Thank you again for your help.

Regards
Swapnil

On Fri, Aug 21, 2015 at 7:09 PM, Swapnil Daingade <
[email protected]> wrote:

> Hi All,
>
> Need help with a Mesos issue.
> I am currently working on the Myriad HA implementation.
>
> Currently we launch NodeManager as a mesos task. With Fine grained
> scaling, we also launch
> yarn containers as mesos tasks. I learned from Santosh that in order to
> launch tasks on the
> same executor, the ExecutorInfo object has to be exactly the same.
> The ExecutorInfo object now contains the command line for launching the
> NodeManager (after NM+ Executor merge)
> So I save the ExecutorInfo and this works fine when the Scheduler + RM is
> not restarted.
>
> In order to makes sure that Fine grain scaling continues to work after
> Scheduler + RM restart, I preserve and retrieve
> the ExecutorInfo object from the StateStore (pull request for this is out
> for review). I have verified that this storage and
> retrieval is working correctly.
>
> When the RM+scheduler restarts and recovers (reads state from state store,
> NM's reconnect to RM etc),
> I run a job. The Yarn side of things work fine and the job succeeds.
> However I don't see the MyriadExecutor get calls to the launchTask method
> at all.
>
> In fact when the MyriadExecutor tries to send status for containers
> running, I see the following messages in the mesos-slave.INFO log
> I0821 18:11:35.606885 27735 slave.cpp:2671] Handling status update
> TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task
> yarn_container_1440205515169_0002_01_000003 of framework
> 20150821-140313-2321877514-5050-3434-0006 from executor(1)@
> 10.10.101.138:36136
> W0821 18:11:35.607535 27735 slave.cpp:2715] *Could not find the executor
> for status update* TASK_FINISHED (UUID:
> a650ca87-8d91-4d93-81dd-a5493249c954) for task
> yarn_container_1440205515169_0002_01_000003 of framework
> 20150821-140313-2321877514-5050-3434-0006
> I0821 18:11:35.608109 27757 status_update_manager.cpp:322] Received status
> update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task
> yarn_container_1440205515169_0002_01_000003 of framework
> 20150821-140313-2321877514-5050-3434-0006
> I0821 18:11:35.608927 27735 slave.cpp:2926] Forwarding the update
> TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task
> yarn_container_1440205515169_0002_01_000003 of framework
> 20150821-140313-2321877514-5050-3434-0006 to [email protected]:5050
> I0821 18:11:35.609191 27735 slave.cpp:2856] Sending acknowledgement for
> status update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954)
> for task yarn_container_1440205515169_0002_01_000003 of framework
> 20150821-140313-2321877514-5050-3434-0006 to executor(1)@
> 10.10.101.138:36136
> I0821 18:11:35.613358 27750 status_update_manager.cpp:394] Received status
> update acknowledgement (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for
> task yarn_container_1440205515169_0002_01_000003 of framework
> 20150821-140313-2321877514-5050-3434-0006
>
> I believe Mesos is not creating a mesos task for these yarn containers
> (I don't see the  yarn_container_<container_id> tasks in the mesos UI
> after Scheduler + RM restart, before restart I do see these tasks).
>
> I am trying to understand if there anything I may be missing that's
> causing the executor to not receive the launchTask call after Scheduler +
> RM restart.
>
> Regards
> Swapnil
>
>

Fwd: Not receiving calls to Executor after Scheduler restart

Reply via email to