On Tue, Feb 03, 2015 at 07:28:37AM -0800, Adam Bordelon wrote: > Just make sure you only send one LaunchTasksMessage per slave, although that > message could contain multiple tasks launched on a collection of offers from > the same slave.
Yes, generally we use the *deprecated* launchTasks with a single offer. > You mention that launching 1000s in the same message causes Mesos to crash. Do > you have a crash stack available for this? See here. https://issues.apache.org/jira/browse/MESOS-1804 https://issues.apache.org/jira/browse/MESOS-1795 > You shouldn't have to respond to all offers received before tasks get > launched. Thanks! > Some frameworks "hoard" offers in case they want to launch something on them > later, but launch other tasks in the meantime. Perhaps the delay has something > to do with Chronos' cron-like scheduling feature? I'll confirm this and keep you update. -- Thanks, Chengwei > > On Tue, Feb 3, 2015 at 5:46 AM, Chengwei Yang <chengwei.yang...@gmail.com> > wrote: > > Hi List, > > We are running chronos on mesos 0.19.0 and found a interesting problem, > that if > we were trying to launch about 1k tasks in a single resourceOffers(), it > may crash > and no tasks started by mesos at all. > > So we did a test like below: > > change code in chronos resourceOffers() callback as below: > > 1. print log > 2. decline the first offer in bunch of offers > 3. sleep 30 seconds > 4. decline all the offers received > > add a log in src/master/master.cpp to print some log whenever received a > LaunchTasksMessage, see below log. > > -----------8<----------------------- > I0203 18:32:33.169342 7680 master.cpp:2939] Sending 3 offers to framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:32:39.523227 7670 http.cpp:452] HTTP request for '/master/ > state.json' > I0203 18:32:49.601284 7674 http.cpp:452] HTTP request for '/master/ > state.json' > I0203 18:32:59.677875 7677 http.cpp:452] HTTP request for '/master/ > state.json' > I0203 18:33:03.390188 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.390949 7676 master.cpp:1895] Processing reply for offers: > [ > 20150203-183014-2487817994-5050-7668-0 ] on slave > 20150203-183014-2487817994-5050-7668-2 at slave(1)@10.23.73.140:5051 > (xulijian-mesos-online016-cqdx.qiyi.virtual) for framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.391469 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.391791 7670 hierarchical_allocator_process.hpp:589] > Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave > 20150203-183014-2487817994-5050-7668-2 for 5secs > W0203 18:33:03.392019 7676 master.cpp:1871] Failed to validate offer > 20150203-183014-2487817994-5050-7668-0: Offer > 20150203-183014-2487817994-5050-7668-0 is no longer valid > I0203 18:33:03.393173 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-1 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.393601 7676 master.cpp:1895] Processing reply for offers: > [ > 20150203-183014-2487817994-5050-7668-1 ] on slave > 20150203-183014-2487817994-5050-7668-1 at slave(1)@10.23.73.141:5051 > (xulijian-mesos-online017-cqdx.qiyi.virtual) for framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.394057 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-2 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.394379 7679 hierarchical_allocator_process.hpp:589] > Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave > 20150203-183014-2487817994-5050-7668-1 for 5secs > I0203 18:33:03.394664 7676 master.cpp:1895] Processing reply for offers: > [ > 20150203-183014-2487817994-5050-7668-2 ] on slave > 20150203-183014-2487817994-5050-7668-0 at slave(1)@10.23.73.148:5051 > (xulijian-mesos-online015-cqdx.qiyi.virtual) for framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.395504 7676 hierarchical_allocator_process.hpp:589] > Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave > 20150203-183014-2487817994-5050-7668-0 for 5secs > ---------------8<------------------- > > As we can see, mesos-master send offer to chronos at 18:32:33, but > received > all > 4 decline message (LaunchTasksMessage) at 18:33.03, we are very curious > why > the > first decline doesn't sent before sleep 30 seconds? > > >From the log, we see that the offer 0 is no longer valid because we > already send > a decline before. > > Does that mean we(the framework scheduler) have to reply for all offers > received > before we can launch any task? > > -- > Thanks, > Chengwei > >
signature.asc
Description: Digital signature