[
https://issues.apache.org/jira/browse/MESOS-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783979#comment-15783979
]
Christopher Hunt commented on MESOS-6252:
-----------------------------------------
I believe that it is reasonable that command line arguments can change between
executions. Arguments can pertain to the state of the scheduler at the time a
new executor is required. In our scenario, we wish to communicate a "seed node"
to the executor such that it may connect. This seed node is communicated as an
IP address, which can indeed change between invocations. Note that once the
seed node communication is established in our world, then gossiping of these
seed nodes occur and this argument becomes largely irrelevant. However, should
a new executor require establishing for some reason then e.g. the old ones have
somehow died, then the seed node argument once again becomes relevant.
Perhaps as a compromise, only the first argument of a command should be
compared by Mesos i.e. the path to the command itself should be compared
thereby accepting that a command's arguments are variable.
> Do not validate start command when re-establishing connection to executor
> -------------------------------------------------------------------------
>
> Key: MESOS-6252
> URL: https://issues.apache.org/jira/browse/MESOS-6252
> Project: Mesos
> Issue Type: Bug
> Components: general
> Affects Versions: 0.28.1
> Environment: coreos
> Reporter: Markus Jura
>
> When a framework re-connects to an existing executor then Mesos is checking
> if the new start command of the {{ExecutorInfo}} equals the old start
> command.
> In case of the ConductR framework, these start command can be different due
> to a different value in the ConductR agent argument {{--core-node}}.
> As a result, Mesos master is sending a {{TASK_ERROR}} for each running task
> to the framework. The reason of the error is {{REASON_TASK_INVALID}}.
> {code}
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
> MesosSchedulerClient
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
> akkaTimestamp=11:34:48.713UTC,
> akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
> sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state
> TASK_ERROR received by the scheduler: task_id {
> value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
> }
> state: TASK_ERROR
> message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same
> ExecutorID is not
> compatible).\n------------------------------------------------------------\nExisting
> ExecutorInfo:\nexecutor_id {\n value:
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
> type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources
> {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n
> role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n
> value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type:
> RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n
> range {\n begin: 10000\n end: 10999\n }\n }\n role:
> \"*\"\n}\ncommand {\n uris {\n value:
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
> executable: false\n extract: true\n cache: false\n }\n uris {\n
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
> executable: false\n extract: true\n cache: false\n }\n value:
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*)
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
> -Dconductr-agent.run.allocated-ports.start=10000
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
> \"conductr\"\n\n------------------------------------------------------------\nTask\'s
> ExecutorInfo:\nexecutor_id {\n value:
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
> type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources
> {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n
> role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n
> value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type:
> RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n
> range {\n begin: 10000\n end: 10999\n }\n }\n role:
> \"*\"\n}\ncommand {\n uris {\n value:
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
> executable: false\n extract: true\n cache: false\n }\n uris {\n
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
> executable: false\n extract: true\n cache: false\n }\n value:
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*)
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
> -Dconductr-agent.run.allocated-ports.start=10000
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
> \"conductr\"\n\n------------------------------------------------------------\n"
> slave_id {
> value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
> }
> timestamp: 1.474889688506464E9
> source: SOURCE_MASTER
> reason: REASON_TASK_INVALID
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
> MesosSchedulerClient
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
> akkaTimestamp=11:34:48.714UTC,
> akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
> sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state
> TASK_ERROR received by the scheduler: task_id {
> value: "40034b01-e853-4ada-882f-9aaab67f77c2"
> }
> {code}
> Mesos should only validate the executor id. If the new id of the
> {{ExecutorInfo}} object equals the old one then it should allow the
> reconnection to the running executor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)