[
https://issues.apache.org/jira/browse/MESOS-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph Wu updated MESOS-6252:
-----------------------------
Description:
When a framework re-connects to an existing executor then Mesos is checking if
the new start command of the {{ExecutorInfo}} equals the old start command.
In case of the ConductR framework, these start command can be different due to
a different value in the ConductR agent argument {{--core-node}}.
As a result, Mesos master is sending a {{TASK_ERROR}} for each running task to
the framework. The reason of the error is {{REASON_TASK_INVALID}}.
{code}
2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
akkaTimestamp=11:34:48.713UTC,
akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR
received by the scheduler: task_id {
value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
}
state: TASK_ERROR
message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same
ExecutorID is not
compatible).\n------------------------------------------------------------\nExisting
ExecutorInfo:\nexecutor_id {\n value:
\"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n
name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role:
\"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value:
1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n
ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n
begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n
uris {\n value:
\"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
executable: false\n extract: true\n cache: false\n }\n uris {\n
value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
executable: false\n extract: true\n cache: false\n }\n value:
\"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) &&
./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
-Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
-Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004
--core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
\"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\nTask\'s
ExecutorInfo:\nexecutor_id {\n value:
\"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n
name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role:
\"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value:
1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n
ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n
begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n
uris {\n value:
\"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
executable: false\n extract: true\n cache: false\n }\n uris {\n
value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
executable: false\n extract: true\n cache: false\n }\n value:
\"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) &&
./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
-Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
-Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004
--core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
\"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\n"
slave_id {
value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
}
timestamp: 1.474889688506464E9
source: SOURCE_MASTER
reason: REASON_TASK_INVALID
2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
akkaTimestamp=11:34:48.714UTC,
akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR
received by the scheduler: task_id {
value: "40034b01-e853-4ada-882f-9aaab67f77c2"
}
{code}
Mesos should only validate the executor id. If the new id of the
{{ExecutorInfo}} object equals the old one then it should allow the
reconnection to the running executor.
was:
When a framework re-connects to an existing executor then Mesos is checking if
the new start command of the {{ExecutorInfo}} equals the old start command.
In case of the ConductR framework, these start command can be different due to
a different value in the ConductR agent argument {{--core-node}}.
As a result, Mesos master is sending a {{TASK_ERROR}} for each running task to
the framework. The reason of the error is {{REASON_TASK_INVALID}}.
{{code:bash}}
2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
akkaTimestamp=11:34:48.713UTC,
akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR
received by the scheduler: task_id {
value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
}
state: TASK_ERROR
message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same
ExecutorID is not
compatible).\n------------------------------------------------------------\nExisting
ExecutorInfo:\nexecutor_id {\n value:
\"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n
name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role:
\"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value:
1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n
ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n
begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n
uris {\n value:
\"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
executable: false\n extract: true\n cache: false\n }\n uris {\n
value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
executable: false\n extract: true\n cache: false\n }\n value:
\"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) &&
./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
-Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
-Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004
--core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
\"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\nTask\'s
ExecutorInfo:\nexecutor_id {\n value:
\"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n
name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role:
\"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value:
1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n
ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n
begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n
uris {\n value:
\"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
executable: false\n extract: true\n cache: false\n }\n uris {\n
value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
executable: false\n extract: true\n cache: false\n }\n value:
\"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) &&
./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
-Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
-Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004
--core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
\"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\n"
slave_id {
value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
}
timestamp: 1.474889688506464E9
source: SOURCE_MASTER
reason: REASON_TASK_INVALID
2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
akkaTimestamp=11:34:48.714UTC,
akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR
received by the scheduler: task_id {
value: "40034b01-e853-4ada-882f-9aaab67f77c2"
}
{{code}}
Mesos should only validate the executor id. If the new id of the
{{ExecutorInfo}} object equals the old one then it should allow the
reconnection to the running executor.
> Do not validate start command when re-establishing connection to executor
> -------------------------------------------------------------------------
>
> Key: MESOS-6252
> URL: https://issues.apache.org/jira/browse/MESOS-6252
> Project: Mesos
> Issue Type: Bug
> Components: general
> Affects Versions: 0.28.1
> Environment: coreos
> Reporter: Markus Jura
>
> When a framework re-connects to an existing executor then Mesos is checking
> if the new start command of the {{ExecutorInfo}} equals the old start
> command.
> In case of the ConductR framework, these start command can be different due
> to a different value in the ConductR agent argument {{--core-node}}.
> As a result, Mesos master is sending a {{TASK_ERROR}} for each running task
> to the framework. The reason of the error is {{REASON_TASK_INVALID}}.
> {code}
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
> MesosSchedulerClient
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
> akkaTimestamp=11:34:48.713UTC,
> akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
> sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state
> TASK_ERROR received by the scheduler: task_id {
> value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
> }
> state: TASK_ERROR
> message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same
> ExecutorID is not
> compatible).\n------------------------------------------------------------\nExisting
> ExecutorInfo:\nexecutor_id {\n value:
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
> type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources
> {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n
> role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n
> value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type:
> RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n
> range {\n begin: 10000\n end: 10999\n }\n }\n role:
> \"*\"\n}\ncommand {\n uris {\n value:
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
> executable: false\n extract: true\n cache: false\n }\n uris {\n
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
> executable: false\n extract: true\n cache: false\n }\n value:
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*)
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
> -Dconductr-agent.run.allocated-ports.start=10000
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
> \"conductr\"\n\n------------------------------------------------------------\nTask\'s
> ExecutorInfo:\nexecutor_id {\n value:
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n
> type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources
> {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n
> role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n
> value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type:
> RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n
> range {\n begin: 10000\n end: 10999\n }\n }\n role:
> \"*\"\n}\ncommand {\n uris {\n value:
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
> executable: false\n extract: true\n cache: false\n }\n uris {\n
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
> executable: false\n extract: true\n cache: false\n }\n value:
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*)
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552
> -Dconductr-agent.run.allocated-ports.start=10000
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value:
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
> \"conductr\"\n\n------------------------------------------------------------\n"
> slave_id {
> value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
> }
> timestamp: 1.474889688506464E9
> source: SOURCE_MASTER
> reason: REASON_TASK_INVALID
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR
> MesosSchedulerClient
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22,
> akkaTimestamp=11:34:48.714UTC,
> akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
> sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state
> TASK_ERROR received by the scheduler: task_id {
> value: "40034b01-e853-4ada-882f-9aaab67f77c2"
> }
> {code}
> Mesos should only validate the executor id. If the new id of the
> {{ExecutorInfo}} object equals the old one then it should allow the
> reconnection to the running executor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)