Just make sure you only send one LaunchTasksMessage per slave, although
that message could contain multiple tasks launched on a collection of
offers from the same slave.
You mention that launching 1000s in the same message causes Mesos to crash.
Do you have a crash stack available for this?
You shouldn't have to respond to all offers received before tasks get
launched. Some frameworks "hoard" offers in case they want to launch
something on them later, but launch other tasks in the meantime. Perhaps
the delay has something to do with Chronos' cron-like scheduling feature?

On Tue, Feb 3, 2015 at 5:46 AM, Chengwei Yang <chengwei.yang...@gmail.com>
wrote:

> Hi List,
>
> We are running chronos on mesos 0.19.0 and found a interesting problem,
> that if
> we were trying to launch about 1k tasks in a single resourceOffers(), it
> may crash
> and no tasks started by mesos at all.
>
> So we did a test like below:
>
> change code in chronos resourceOffers() callback as below:
>
> 1. print log
> 2. decline the first offer in bunch of offers
> 3. sleep 30 seconds
> 4. decline all the offers received
>
> add a log in src/master/master.cpp to print some log whenever received a
> LaunchTasksMessage, see below log.
>
> -----------8<-----------------------
> I0203 18:32:33.169342  7680 master.cpp:2939] Sending 3 offers to framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:32:39.523227  7670 http.cpp:452] HTTP request for
> '/master/state.json'
> I0203 18:32:49.601284  7674 http.cpp:452] HTTP request for
> '/master/state.json'
> I0203 18:32:59.677875  7677 http.cpp:452] HTTP request for
> '/master/state.json'
> I0203 18:33:03.390188  7676 master.cpp:1754] Received launchTasks message
> for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:33:03.390949  7676 master.cpp:1895] Processing reply for offers:
> [ 20150203-183014-2487817994-5050-7668-0 ] on slave
> 20150203-183014-2487817994-5050-7668-2 at slave(1)@10.23.73.140:5051
> (xulijian-mesos-online016-cqdx.qiyi.virtual) for framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:33:03.391469  7676 master.cpp:1754] Received launchTasks message
> for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:33:03.391791  7670 hierarchical_allocator_process.hpp:589]
> Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave
> 20150203-183014-2487817994-5050-7668-2 for 5secs
> W0203 18:33:03.392019  7676 master.cpp:1871] Failed to validate offer
> 20150203-183014-2487817994-5050-7668-0: Offer
> 20150203-183014-2487817994-5050-7668-0 is no longer valid
> I0203 18:33:03.393173  7676 master.cpp:1754] Received launchTasks message
> for offer [ 20150203-183014-2487817994-5050-7668-1 ] of framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:33:03.393601  7676 master.cpp:1895] Processing reply for offers:
> [ 20150203-183014-2487817994-5050-7668-1 ] on slave
> 20150203-183014-2487817994-5050-7668-1 at slave(1)@10.23.73.141:5051
> (xulijian-mesos-online017-cqdx.qiyi.virtual) for framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:33:03.394057  7676 master.cpp:1754] Received launchTasks message
> for offer [ 20150203-183014-2487817994-5050-7668-2 ] of framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:33:03.394379  7679 hierarchical_allocator_process.hpp:589]
> Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave
> 20150203-183014-2487817994-5050-7668-1 for 5secs
> I0203 18:33:03.394664  7676 master.cpp:1895] Processing reply for offers:
> [ 20150203-183014-2487817994-5050-7668-2 ] on slave
> 20150203-183014-2487817994-5050-7668-0 at slave(1)@10.23.73.148:5051
> (xulijian-mesos-online015-cqdx.qiyi.virtual) for framework
> 20150203-174243-2487817994-5050-10996-0000
> I0203 18:33:03.395504  7676 hierarchical_allocator_process.hpp:589]
> Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave
> 20150203-183014-2487817994-5050-7668-0 for 5secs
> ---------------8<-------------------
>
> As we can see, mesos-master send offer to chronos at 18:32:33, but
> received all
> 4 decline message (LaunchTasksMessage) at 18:33.03, we are very curious
> why the
> first decline doesn't sent before sleep 30 seconds?
>
> From the log, we see that the offer 0 is no longer valid because we
> already send
> a decline before.
>
> Does that mean we(the framework scheduler) have to reply for all offers
> received
> before we can launch any task?
>
> --
> Thanks,
> Chengwei
>

Reply via email to