[ 
https://issues.apache.org/jira/browse/MESOS-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David vonThenen updated MESOS-6054:
-----------------------------------
    Attachment: _usr_sbin_mesos-slave.0.crash

A core dump of the mesos-slave process crash

> Agent Crash with Malformed UUID when doing TaskUpdate
> -----------------------------------------------------
>
>                 Key: MESOS-6054
>                 URL: https://issues.apache.org/jira/browse/MESOS-6054
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework api
>    Affects Versions: 1.0.0
>         Environment: Ubuntu 14.04, Mesos 1.0.0-2.0.89.ubuntu1404, Marathon 
> 1.1.2
>            Reporter: David vonThenen
>            Priority: Minor
>         Attachments: _usr_sbin_mesos-slave.0.crash
>
>
> When using the HTTP API using protobufs, if the UUID in a TaskUpdate is 
> malformed (in this case, was using a UUID that was base64 encoded), it would 
> cause the Agent where the executor is running on to crash and restart.
> Here is a JSON dump of the protobuf used:
> {code}
> {
>   "executor_id": {
>     "value": "executor-scaleio1"
>   },
>   "framework_id": {
>     "value": "ac8545a7-f8fc-431e-bc36-0239c4460658-0002"
>   },
>   "type": 2,
>   "update": {
>     "status": {
>       "task_id": {
>         "value": "scaleio1"
>       },
>       "state": 1,
>       "source": 2,
>       "executor_id": {
>         "value": "executor-scaleio1"
>       },
>       "uuid": 
> "WVdVd01EQTFNakF0TkdVeU9TMDBNell3TFdJMk4yUXRPR05sT1RFNU56VmlPREUw"
>     }
>   }
> }
> {code}
> In the master it looks like is processes the accept calls… but after it 
> processes all of them, it looks like the agents are immediately being 
> disconnected:
> {code}
> ...
> ...
> I0816 17:53:09.974340  4010 master.cpp:3342] Processing ACCEPT call for 
> offers: [ 2bf179c3-004a-49e3-98ab-5a75fa773522-O80 ] on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com) for framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework)
> W0816 17:53:09.974578  4010 validation.cpp:647] Executor executor-scaleio4 
> for task scaleio4 uses less CPUs (None) than the minimum required (0.01). 
> Please update your executor, as this will be mandatory in future releases.
> W0816 17:53:09.974604  4010 validation.cpp:659] Executor executor-scaleio4 
> for task scaleio4 uses less memory (None) than the minimum required (32MB). 
> Please update your executor, as this will be mandatory in future releases.
> I0816 17:53:09.974645  4010 master.cpp:7439] Adding task scaleio4 with 
> resources cpus(*):1; mem(*):2048 on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:09.974668  4010 master.cpp:3831] Launching task scaleio4 of 
> framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework) with 
> resources cpus(*):1; mem(*):2048 on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306182  4010 master.cpp:1245] Agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com) disconnected
> I0816 17:53:11.306335  4010 master.cpp:2784] Disconnecting agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306520  4010 master.cpp:2803] Deactivating agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306676  4010 master.cpp:1264] Removing framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework) from 
> disconnected agent 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at 
> slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com) because the framework is 
> not checkpointing
> I0816 17:53:11.306798  4010 master.cpp:6448] Removing framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework) from agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306882  4010 master.cpp:6833] Updating the state of task 
> scaleio4 of framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (latest 
> state: TASK_LOST, status update state: TASK_LOST)
> I0816 17:53:11.306778  4013 hierarchical.cpp:571] Agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 deactivated
> I0816 17:53:11.307140  4010 master.cpp:6899] Removing task scaleio4 with 
> resources cpus(*):1; mem(*):2048 of framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.307312  4010 master.cpp:5190] Sending status update TASK_LOST 
> for task scaleio4 of framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 
> 'Slave ec2-52-89-227-184.us-west-2.compute.amazonaws.com disconnected'
> I0816 17:53:11.307533  4010 master.cpp:6928] Removing executor 
> 'executor-scaleio4' with resources  of framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.472939  4017 master.cpp:1245] Agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S4 at slave(1)@172.31.17.252:5051 
> (ec2-52-88-195-213.us-west-2.compute.amazonaws.com) disconnected
> ...
> ...
> {code}
> The agent receives the POST from the executor:
> {code}
> ...
> ...
> I0816 17:51:09.001888  1237 slave.cpp:4591] Current disk usage 31.86%. Max 
> allowed age: 4.069593432939398days
> I0816 17:52:09.002300  1236 slave.cpp:4591] Current disk usage 31.86%. Max 
> allowed age: 4.069545128332523days
> I0816 17:53:09.002799  1234 slave.cpp:4591] Current disk usage 31.86%. Max 
> allowed age: 4.069496823725636days
> I0816 17:53:10.033020  1240 slave.cpp:1495] Got assigned task scaleio3 for 
> framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001
> I0816 17:53:10.033210  1240 slave.cpp:1614] Launching task scaleio3 for 
> framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001
> I0816 17:53:10.033980  1240 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/2bf179c3-004a-49e3-98ab-5a75fa773522-S5/frameworks/2bf179c3-004a-49e3-98ab-5a75fa773522-0001/executors/executor-scaleio3/runs/9aa4ee18-350b-4a65-a36b-eef9449f5d11'
>  to user 'root'
> I0816 17:53:10.036744  1240 slave.cpp:5674] Launching executor 
> executor-scaleio3 of framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 with 
> resources  in work directory 
> '/tmp/mesos/slaves/2bf179c3-004a-49e3-98ab-5a75fa773522-S5/frameworks/2bf179c3-004a-49e3-98ab-5a75fa773522-0001/executors/executor-scaleio3/runs/9aa4ee18-350b-4a65-a36b-eef9449f5d11'
> I0816 17:53:10.036864  1240 slave.cpp:1840] Queuing task 'scaleio3' for 
> executor 'executor-scaleio3' of framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001
> I0816 17:53:10.036898  1237 containerizer.cpp:781] Starting container 
> '9aa4ee18-350b-4a65-a36b-eef9449f5d11' for executor 'executor-scaleio3' of 
> framework '2bf179c3-004a-49e3-98ab-5a75fa773522-0001'
> I0816 17:53:10.037387  1240 linux_launcher.cpp:281] Cloning child process 
> with flags = 
> I0816 17:53:10.457927  1234 http.cpp:270] HTTP POST for 
> /slave(1)/api/v1/executor from 172.31.23.107:49326 with 
> User-Agent='scaleio/0.1'
> I0816 17:53:10.458055  1234 slave.cpp:2661] Received Subscribe request for 
> HTTP executor 'executor-scaleio3' of framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001
> I0816 17:53:10.462604  1234 slave.cpp:2005] Sending queued task 'scaleio3' to 
> executor 'executor-scaleio3' of framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (via HTTP)
> I0816 17:53:11.464956  1233 http.cpp:270] HTTP POST for 
> /slave(1)/api/v1/executor from 172.31.23.107:49328 with 
> User-Agent='scaleio/0.1'
> {code}
> Then crashes out and the agent restarts with a new agent log:
> {code}
> Log file created at: 2016/08/16 17:53:11
> Running on machine: ec2-52-38-65-6.us-west-2.compute.amazonaws.com
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> I0816 17:53:11.674993  4977 logging.cpp:194] INFO level logging started!
> I0816 17:53:11.678026  4977 containerizer.cpp:196] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> I0816 17:53:11.681545  4977 linux_launcher.cpp:101] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0816 17:53:11.682831  4977 main.cpp:434] Starting Mesos agent
> ...
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to