Hi All, I was able to figure out the root cause of the problem.
Looks like we have a bug in the scheduler. We seem to decline mesos offers if we cannot launch a NM task against then. We also launch only one NM per node. Thus, if have NMs running on all nodes and if we do a flexup, the new NM task remains pending. Since we cannot launch the new NM Task against incoming offers, we keep declining offers. When we do fine grain scaling, we launch tasks against these declined offers. This results in messages like the one below in the mesos-master logs ACCEPT call used invalid offers '[ 20150826-222858-2321877514-5050-1026-O645 ]': Offer 20150826-222858-2321877514-5050-1026-O645 is no longer valid. I am adding a fix to this and will update the pull request. Regards Swapnil On Thu, Aug 27, 2015 at 2:39 AM, Swapnil Daingade < [email protected]> wrote: > The Apache email server seems to have some limit on the size of attachment. > My message is bouncing when I try to send the logs to the list. > Will send Adam the logs directly. > > Regards > Swapnil > > ---------- Forwarded message ---------- > From: Swapnil Daingade <[email protected]> > Date: Thu, Aug 27, 2015 at 2:24 AM > Subject: Re: Not receiving calls to Executor after Scheduler restart > To: [email protected] > > > Hi Adam, > > Thank you for offering to help with this. > > Attaching logs as per our discussion during the call this morning. > Deployed a 2 node mesos cluster > > 10.10.101.138 had mesos-master, zookeeper, mesos-slave, mesos-dns, marathon > 10.10.101.140 had mesos-slave > > Launched the RM using Marathon with HA enabled. > > After RM starts (on 10.10.101.140), it automatically launches a NM that > uses course grain scaling. > Did a flex up with zero profile NM. Now there are 2 NM's (one FGS, one > CGS). > > Both NM's show up as active tasks. > > Launched a yarn job. Can see mesos tasks corresponding to yarn containers > launched on FGS node. > No placeholder mesos tasks are seen on the CGS node as expected. > yarn job completes successfully. > > Killed RM. Marathon restarts another instance of RM (on 10.10.101.138). > RM recovers state from state store, reconnects with NM's. > > Deleted the output directory for the previous yarn job and launched the > same job again. > This time I cannot see mesos tasks corresponding to yarn containers > launched on FGS node. > yarn job completes successfully. > > Logs attached. The WARNING and INFO logs for mesos-master and mesos-slave > seem interesting. > > The node 10.10.101.138 is running the CGS NM. you can ignore the "Could > not lookup task for status update" warning > messages for this NM. The NM is reporting task updates for yarn containers > even when it is CGS NM and it does not > launch placeholder tasks. > > However the node 10.10.101.140 is running FGS NM. It should be launching > placeholder tasks. > instead, I see log messages like > ACCEPT call used invalid offers '[ > 20150826-222858-2321877514-5050-1026-O645 ]': Offer > 20150826-222858-2321877514-5050-1026-O645 is no longer valid > > On the scheduler side, I am getting TASK_LOST statusUpdates for attempts > to launch placeholder tasks after RM restart. > > task_id { > value: "yarn_container_e01_1440665526715_0001_01_000004" > } > state: TASK_LOST > message: "Task launched with invalid offers: Offer > 20150827-002328-2321877514-5050-20344-O2892 is no longer valid" > slave_id { > value: "20150827-002328-2321877514-5050-20344-S0" > } > timestamp: 1.440665650436197E9 > source: SOURCE_MASTER > reason: REASON_INVALID_OFFERS > > Thank you again for your help. > > Regards > Swapnil > > > > On Fri, Aug 21, 2015 at 7:09 PM, Swapnil Daingade < > [email protected]> wrote: > >> Hi All, >> >> Need help with a Mesos issue. >> I am currently working on the Myriad HA implementation. >> >> Currently we launch NodeManager as a mesos task. With Fine grained >> scaling, we also launch >> yarn containers as mesos tasks. I learned from Santosh that in order to >> launch tasks on the >> same executor, the ExecutorInfo object has to be exactly the same. >> The ExecutorInfo object now contains the command line for launching the >> NodeManager (after NM+ Executor merge) >> So I save the ExecutorInfo and this works fine when the Scheduler + RM is >> not restarted. >> >> In order to makes sure that Fine grain scaling continues to work after >> Scheduler + RM restart, I preserve and retrieve >> the ExecutorInfo object from the StateStore (pull request for this is out >> for review). I have verified that this storage and >> retrieval is working correctly. >> >> When the RM+scheduler restarts and recovers (reads state from state >> store, NM's reconnect to RM etc), >> I run a job. The Yarn side of things work fine and the job succeeds. >> However I don't see the MyriadExecutor get calls to the launchTask method >> at all. >> >> In fact when the MyriadExecutor tries to send status for containers >> running, I see the following messages in the mesos-slave.INFO log >> I0821 18:11:35.606885 27735 slave.cpp:2671] Handling status update >> TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task >> yarn_container_1440205515169_0002_01_000003 of framework >> 20150821-140313-2321877514-5050-3434-0006 from executor(1)@ >> 10.10.101.138:36136 >> W0821 18:11:35.607535 27735 slave.cpp:2715] *Could not find the executor >> for status update* TASK_FINISHED (UUID: >> a650ca87-8d91-4d93-81dd-a5493249c954) for task >> yarn_container_1440205515169_0002_01_000003 of framework >> 20150821-140313-2321877514-5050-3434-0006 >> I0821 18:11:35.608109 27757 status_update_manager.cpp:322] Received >> status update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) >> for task yarn_container_1440205515169_0002_01_000003 of framework >> 20150821-140313-2321877514-5050-3434-0006 >> I0821 18:11:35.608927 27735 slave.cpp:2926] Forwarding the update >> TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) for task >> yarn_container_1440205515169_0002_01_000003 of framework >> 20150821-140313-2321877514-5050-3434-0006 to [email protected]:5050 >> I0821 18:11:35.609191 27735 slave.cpp:2856] Sending acknowledgement for >> status update TASK_FINISHED (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) >> for task yarn_container_1440205515169_0002_01_000003 of framework >> 20150821-140313-2321877514-5050-3434-0006 to executor(1)@ >> 10.10.101.138:36136 >> I0821 18:11:35.613358 27750 status_update_manager.cpp:394] Received >> status update acknowledgement (UUID: a650ca87-8d91-4d93-81dd-a5493249c954) >> for task yarn_container_1440205515169_0002_01_000003 of framework >> 20150821-140313-2321877514-5050-3434-0006 >> >> I believe Mesos is not creating a mesos task for these yarn containers >> (I don't see the yarn_container_<container_id> tasks in the mesos UI >> after Scheduler + RM restart, before restart I do see these tasks). >> >> I am trying to understand if there anything I may be missing that's >> causing the executor to not receive the launchTask call after Scheduler + >> RM restart. >> >> Regards >> Swapnil >> >> > >
