Jay Buffington created AURORA-589:
-------------------------------------
Summary: using {{thermos.ports[foo_{{mesos.instance}}]}} creates
task that can never start
Key: AURORA-589
URL: https://issues.apache.org/jira/browse/AURORA-589
Project: Aurora
Issue Type: Bug
Reporter: Jay Buffington
Priority: Critical
Today a user wrote a .aurora file that had this command line:
{noformat}
'echo {{thermos.ports[foo_{{mesos.instance}}]}}; sleep 300',
{noformat}
It caused thermos_runner to fail to start. __main__.log had this content:
{noformat}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 22:12:44.033905 25696 fetcher.cpp:76] Fetching URI
'/usr/local/bin/thermos_executor'
I0717 22:12:44.034059 25696 fetcher.cpp:179] Copying resource from
'/usr/local/bin/thermos_executor' to
'/tmp/mesos/slaves/20140623-183547-1749004561-5050-30136-42/frameworks/20140522-213145-1749004561-5050-29512-0000/executors/thermos-1405635163935-jaybuff-test-thermos_bug-0-386f11a7-3998-460b-ade7-7293cc75a860/runs/f72554fd-e6a0-46c7-8d09-989fd0b4c399'
twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
Writing log files to disk in .
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 22:12:45.586942 25695 exec.cpp:131] Version: 0.18.0
I0717 22:12:45.590029 25716 exec.cpp:205] Executor registered on slave
20140623-183547-1749004561-5050-30136-42
Writing log files to disk in .
thermos_runner.pex: error: ERROR! Unbound ports: foo_0
ERROR] Task did not start with in deadline, forcing loss.
FATAL] Task initialization failed: Task did not start within deadline.
twitter.common.app debug: Shutting application down.
twitter.common.app debug: Running exit function for twitter.common.log (Logging
subsystem.)
twitter.common.app debug: Finishing up module teardown.
twitter.common.app debug: Active thread: <_MainThread(MainThread, started
139823719085824)>
twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-2,
started daemon 139823370553088)>
twitter.common.app debug: Exiting cleanly.
{noformat}
The aurora scheduler web ui shows the task failing over and over with each
completed task showing this state transition:
{noformat}
07/17 5:29:45 local - PENDING - Rescheduled
07/17 5:29:45 local - ASSIGNED
07/17 5:29:46 local - STARTING - Initializing sandbox.
07/17 5:30:46 local - FAILED - Task initialization failed: Task did not start
within deadline
{noformat}
I'm still discussing with my user if he has a valid use case for this or not
(amazingly, I think he might). Either way, the client knows that there is a
port in the cmdline and it's not setting requestedPorts, so the client should
fail before sending the request to the scheduler the runner will never be able
to start.
I've told him to work around it by writing a python loop that iterates over all
the instances and prints out "echo {{thermos.ports[foo_%d]}}"
--
This message was sent by Atlassian JIRA
(v6.2#6252)