Markus Jura created MESOS-6252:
----------------------------------

             Summary: Do not validate start command when re-establishing 
connection to executor
                 Key: MESOS-6252
                 URL: https://issues.apache.org/jira/browse/MESOS-6252
             Project: Mesos
          Issue Type: Bug
          Components: general
    Affects Versions: 0.28.1
         Environment: coreos
            Reporter: Markus Jura


When a framework re-connects to an existing executor then Mesos is checking if 
the new start command of the {{ExecutorInfo}} equals the old start command. 

In case of the ConductR framework, these start command can be different due to 
a different value in the ConductR agent argument {{--core-node}}.

As a result, Mesos master is sending a {{TASK_ERROR}} for each running task to 
the framework. The reason of the error is {{REASON_TASK_INVALID}}.

{{code:bash}}
2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR 
MesosSchedulerClient 
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, 
akkaTimestamp=11:34:48.713UTC, 
akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
 sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR 
received by the scheduler: task_id {
  value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
}
state: TASK_ERROR
message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same 
ExecutorID is not 
compatible).\n------------------------------------------------------------\nExisting
 ExecutorInfo:\nexecutor_id {\n  value: 
\"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  
type: SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources {\n  
name: \"mem\"\n  type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  role: 
\"*\"\n}\nresources {\n  name: \"disk\"\n  type: SCALAR\n  scalar {\n    value: 
1000\n  }\n  role: \"*\"\n}\nresources {\n  name: \"ports\"\n  type: RANGES\n  
ranges {\n    range {\n      begin: 2552\n      end: 2552\n    }\n    range {\n 
     begin: 10000\n      end: 10999\n    }\n  }\n  role: \"*\"\n}\ncommand {\n  
uris {\n    value: 
\"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n    
executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    
value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n    
executable: false\n    extract: true\n    cache: false\n  }\n  value: 
\"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) && 
./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf 
-Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 
-Dconductr-agent.run.allocated-ports.start=10000 
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 
--core-system-name stop-all-bundles-1\"\n}\nframework_id {\n  value: 
\"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: 
\"conductr\"\n\n------------------------------------------------------------\nTask\'s
 ExecutorInfo:\nexecutor_id {\n  value: 
\"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  
type: SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources {\n  
name: \"mem\"\n  type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  role: 
\"*\"\n}\nresources {\n  name: \"disk\"\n  type: SCALAR\n  scalar {\n    value: 
1000\n  }\n  role: \"*\"\n}\nresources {\n  name: \"ports\"\n  type: RANGES\n  
ranges {\n    range {\n      begin: 2552\n      end: 2552\n    }\n    range {\n 
     begin: 10000\n      end: 10999\n    }\n  }\n  role: \"*\"\n}\ncommand {\n  
uris {\n    value: 
\"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n    
executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    
value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n    
executable: false\n    extract: true\n    cache: false\n  }\n  value: 
\"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) && 
./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf 
-Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 
-Dconductr-agent.run.allocated-ports.start=10000 
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 
--core-system-name stop-all-bundles-1\"\n}\nframework_id {\n  value: 
\"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: 
\"conductr\"\n\n------------------------------------------------------------\n"
slave_id {
  value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
}
timestamp: 1.474889688506464E9
source: SOURCE_MASTER
reason: REASON_TASK_INVALID

2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR 
MesosSchedulerClient 
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, 
akkaTimestamp=11:34:48.714UTC, 
akkaSource=akka.tcp://[email protected]:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
 sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR 
received by the scheduler: task_id {
  value: "40034b01-e853-4ada-882f-9aaab67f77c2"
}
{{code}}

Mesos should only validate the executor id. If the new id of the 
{{ExecutorInfo}} object equals the old one then it should allow the 
reconnection to the running executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to