[ https://issues.apache.org/jira/browse/MESOS-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph Wu updated MESOS-6252: ----------------------------- Description: When a framework re-connects to an existing executor then Mesos is checking if the new start command of the {{ExecutorInfo}} equals the old start command. In case of the ConductR framework, these start command can be different due to a different value in the ConductR agent argument {{--core-node}}. As a result, Mesos master is sending a {{TASK_ERROR}} for each running task to the framework. The reason of the error is {{REASON_TASK_INVALID}}. {code} 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.713UTC, akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client, sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by the scheduler: task_id { value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380" } state: TASK_ERROR message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible).\n------------------------------------------------------------\nExisting ExecutorInfo:\nexecutor_id {\n value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n uris {\n value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n executable: false\n extract: true\n cache: false\n }\n uris {\n value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n executable: false\n extract: true\n cache: false\n }\n value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000 -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: \"conductr\"\n\n------------------------------------------------------------\nTask\'s ExecutorInfo:\nexecutor_id {\n value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n uris {\n value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n executable: false\n extract: true\n cache: false\n }\n uris {\n value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n executable: false\n extract: true\n cache: false\n }\n value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000 -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: \"conductr\"\n\n------------------------------------------------------------\n" slave_id { value: "1154b639-c536-41d1-b9df-a57b24792acb-S4" } timestamp: 1.474889688506464E9 source: SOURCE_MASTER reason: REASON_TASK_INVALID 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.714UTC, akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client, sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by the scheduler: task_id { value: "40034b01-e853-4ada-882f-9aaab67f77c2" } {code} Mesos should only validate the executor id. If the new id of the {{ExecutorInfo}} object equals the old one then it should allow the reconnection to the running executor. was: When a framework re-connects to an existing executor then Mesos is checking if the new start command of the {{ExecutorInfo}} equals the old start command. In case of the ConductR framework, these start command can be different due to a different value in the ConductR agent argument {{--core-node}}. As a result, Mesos master is sending a {{TASK_ERROR}} for each running task to the framework. The reason of the error is {{REASON_TASK_INVALID}}. {{code:bash}} 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.713UTC, akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client, sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by the scheduler: task_id { value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380" } state: TASK_ERROR message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible).\n------------------------------------------------------------\nExisting ExecutorInfo:\nexecutor_id {\n value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n uris {\n value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n executable: false\n extract: true\n cache: false\n }\n uris {\n value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n executable: false\n extract: true\n cache: false\n }\n value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000 -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: \"conductr\"\n\n------------------------------------------------------------\nTask\'s ExecutorInfo:\nexecutor_id {\n value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n range {\n begin: 10000\n end: 10999\n }\n }\n role: \"*\"\n}\ncommand {\n uris {\n value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n executable: false\n extract: true\n cache: false\n }\n uris {\n value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n executable: false\n extract: true\n cache: false\n }\n value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000 -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: \"conductr\"\n\n------------------------------------------------------------\n" slave_id { value: "1154b639-c536-41d1-b9df-a57b24792acb-S4" } timestamp: 1.474889688506464E9 source: SOURCE_MASTER reason: REASON_TASK_INVALID 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.714UTC, akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client, sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by the scheduler: task_id { value: "40034b01-e853-4ada-882f-9aaab67f77c2" } {{code}} Mesos should only validate the executor id. If the new id of the {{ExecutorInfo}} object equals the old one then it should allow the reconnection to the running executor. > Do not validate start command when re-establishing connection to executor > ------------------------------------------------------------------------- > > Key: MESOS-6252 > URL: https://issues.apache.org/jira/browse/MESOS-6252 > Project: Mesos > Issue Type: Bug > Components: general > Affects Versions: 0.28.1 > Environment: coreos > Reporter: Markus Jura > > When a framework re-connects to an existing executor then Mesos is checking > if the new start command of the {{ExecutorInfo}} equals the old start > command. > In case of the ConductR framework, these start command can be different due > to a different value in the ConductR agent argument {{--core-node}}. > As a result, Mesos master is sending a {{TASK_ERROR}} for each running task > to the framework. The reason of the error is {{REASON_TASK_INVALID}}. > {code} > 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR > MesosSchedulerClient > [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, > akkaTimestamp=11:34:48.713UTC, > akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client, > sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state > TASK_ERROR received by the scheduler: task_id { > value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380" > } > state: TASK_ERROR > message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same > ExecutorID is not > compatible).\n------------------------------------------------------------\nExisting > ExecutorInfo:\nexecutor_id {\n value: > \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n > type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources > {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n > role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n > value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: > RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n > range {\n begin: 10000\n end: 10999\n }\n }\n role: > \"*\"\n}\ncommand {\n uris {\n value: > \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n > executable: false\n extract: true\n cache: false\n }\n uris {\n > value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n > executable: false\n extract: true\n cache: false\n }\n value: > \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) > && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf > -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 > -Dconductr-agent.run.allocated-ports.start=10000 > -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 > --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value: > \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: > \"conductr\"\n\n------------------------------------------------------------\nTask\'s > ExecutorInfo:\nexecutor_id {\n value: > \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n name: \"cpus\"\n > type: SCALAR\n scalar {\n value: 0.9\n }\n role: \"*\"\n}\nresources > {\n name: \"mem\"\n type: SCALAR\n scalar {\n value: 402.653184\n }\n > role: \"*\"\n}\nresources {\n name: \"disk\"\n type: SCALAR\n scalar {\n > value: 1000\n }\n role: \"*\"\n}\nresources {\n name: \"ports\"\n type: > RANGES\n ranges {\n range {\n begin: 2552\n end: 2552\n }\n > range {\n begin: 10000\n end: 10999\n }\n }\n role: > \"*\"\n}\ncommand {\n uris {\n value: > \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n > executable: false\n extract: true\n cache: false\n }\n uris {\n > value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n > executable: false\n extract: true\n cache: false\n }\n value: > \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) > && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf > -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 > -Dconductr-agent.run.allocated-ports.start=10000 > -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 > --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n value: > \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: > \"conductr\"\n\n------------------------------------------------------------\n" > slave_id { > value: "1154b639-c536-41d1-b9df-a57b24792acb-S4" > } > timestamp: 1.474889688506464E9 > source: SOURCE_MASTER > reason: REASON_TASK_INVALID > 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR > MesosSchedulerClient > [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, > akkaTimestamp=11:34:48.714UTC, > akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client, > sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state > TASK_ERROR received by the scheduler: task_id { > value: "40034b01-e853-4ada-882f-9aaab67f77c2" > } > {code} > Mesos should only validate the executor id. If the new id of the > {{ExecutorInfo}} object equals the old one then it should allow the > reconnection to the running executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)