Ah yes, this is due to a feature called "framework failover" in that version of Mesos that has an overly large timeout by default. Basically the idea is that if a framework's master disconnects, we give it some time to reconnect before killing its executors and tasks, but this time is by default 1 day. You can fix it by adding the parameter --failover_timeout=1 when running mesos-master. If you're running through the deploy scripts, add failover_timeout=1 to your mesos.conf.
I'll update the Spark wiki to mention this because it's come up a bunch. It will not be an issue in Mesos 0.9. Matei On Apr 20, 2012, at 10:39 AM, Scott Smith wrote: > I'm running Spark git head / Mesos 1205738. My cluster is small -- a > single slave with 2 CPUs and 1.2GB of available RAM. > > I can run SparkPi once, given: > ./run spark.examples.SparkPi master@... > > but I can't run it twice. It seems that each invocation of SparkPi > creates a new framework entry in the webui: > > 201204200627-0-0022 ubuntu SparkPi 0 0 800.0 MB > 0.68 2012-04-20 17:24:47 > > even after waiting for a couple minutes, the memory is still reserved. > > I'm not sure what is supposed to release the resource -- the program > has exited, so the framework shouldn't exist anymore. I added > 'spark.stop()' to the end of the program but that doesn't help. The > only way I've found to clean up the slave is to kill and restart it. > Doing this, however, still leaves stale empty framework entries in the > master: > > 201204200627-0-0018 ubuntu SparkPi 0 0 0.0 MB 0.00 > 2012-04-20 17:09:28 > 201204200627-0-0019 ubuntu SparkPi 0 0 0.0 MB 0.00 > 2012-04-20 17:17:25 > 201204200627-0-0016 ubuntu SparkPi 0 0 0.0 MB 0.00 > 2012-04-20 16:50:35 > 201204200627-0-0017 ubuntu SparkPi 0 0 0.0 MB 0.00 > 2012-04-20 16:51:19 > ..... > > I'm also not sure if instead the correct behavior is that subsequent > invocations of SparkPi should reuse the existing framework -- if so, > how do I make that happen? > > Thanks! > -- > Scott
