[ 
https://issues.apache.org/jira/browse/MESOS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333751#comment-16333751
 ] 

Zhitao Li commented on MESOS-8161:
----------------------------------

The `TASK_ERROR` state was picked by framework author without good 
understanding of the problem.

I talked to the framework owner. The error was generated because they had code 
in executor which may access persistent volume after a task completed. My 
advice to them was to mount the persistent volume onto the executor instead so 
unmount only happens after every task as well as executor itself have 
terminated, thus no code should be accessing persistent volume anymore.

There are enough fixes from framework side and we cannot generate reliable 
reproduce anymore. I'll close the ticket for now. If we observe similar 
behaviors, I'll report again.

 

Thanks for your time, [~jieyu] and [~gilbert].

> Potentially dangerous dangling mount when stopping task with persistent volume
> ------------------------------------------------------------------------------
>
>                 Key: MESOS-8161
>                 URL: https://issues.apache.org/jira/browse/MESOS-8161
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Zhitao Li
>            Priority: Major
>
> While we fixed a case in MESOS-7366 when an executor terminates, it seems 
> like a very similar case can still happen if a task with a persistent volume 
> terminates, executor still active, and [this unmount 
> call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
>  fails due to "device busy".
> I believe if agent gc or something other things run on the host mount 
> namespace, it is possible to lose persistent volume data because of this.
> Agent log:
> {code:none}
> I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for 
> task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
> executor(1)@10.70.142.140:36929
> I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK 
> for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) 
> for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update 
> TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6
> 1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f o
> f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing 
> UPDATE for status update TASK_RUNNING (UUID: 
> c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6
> f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update 
> TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61
> f6d4-fd0f-48be-927d-14282c12301f-0014 to [email protected]:5050
> I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for 
> task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
> executor(1)@10.70.142.140:36929
> I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:43046 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:43144 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max 
> allowed age: 6.283560425078715days
> I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:44732 with User-Agent='Python-urllib/2.7'
> I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:56414 with User-Agent='filebundle-agent'
> I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK 
> for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) 
> for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update 
> TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f
> 6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state 
> TASK_ERROR
> I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount 
> '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex
> ecutor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume'
>  for persistent volume disk(cassandra-test-varung5-framework, cassandra, 
> {resource_id: 411a63af-0fea-4
> d2d-b850-a77039756e99})[a24515db-5b6b-4538-b1da-5c1acf0fe286:volume]:2000000 
> of container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.137557 102241 disk.cpp:207] Updating the disk resources for 
> container da3ecaac-35dc-4464-8204-76577dcde2a8 to 
> cpus(cassandra-test-varung5-framework, cassandra, {resource_id: 58e9c94
> c-7512-45d2-bf7e-453ec42b55bf}):0.1; mem(cassandra-test-varung5-framework, 
> cassandra, {resource_id: 57c52b95-5816-43cd-b8c4-8a46d4271788}):768; 
> ports(cassandra-test-varung5-framework, cassandra, {
> resource_id: 6ed4bec0-b7bf-4513-8c65-197f6e3e1a44}):[31001-31001]
> I1101 20:20:14.137765 102241 disk.cpp:312] Checking disk usage at 
> '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/no
> de-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume'
>  for container da3ecaac-35dc-4464-8204-76577dcde2a8 has been cancelled
> I1101 20:20:14.137763 102227 memory.cpp:199] Updated 
> 'memory.soft_limit_in_bytes' to 768MB for container 
> da3ecaac-35dc-4464-8204-76577dcde2a8
> E1101 20:20:14.138339 102212 slave.cpp:3903] Failed to update resources for 
> container da3ecaac-35dc-4464-8204-76577dcde2a8 of executor 
> 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' running task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f on status update for terminal 
> task, destroying container: Collect failed: Failed to unmount unneeded 
> persistent volume at 
> '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume':
>  Failed to unmount 
> '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume':
>  Device or resource busy
> I1101 20:20:14.138587 102220 status_update_manager.cpp:323] Received status 
> update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138608 102233 containerizer.cpp:1955] Destroying container 
> da3ecaac-35dc-4464-8204-76577dcde2a8 in RUNNING state
> I1101 20:20:14.138664 102220 status_update_manager.cpp:832] Checkpointing 
> UPDATE for status update TASK_ERROR (UUID: 
> 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138756 102232 linux_launcher.cpp:498] Asked to destroy 
> container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.138847 102214 slave.cpp:4051] Forwarding the update TASK_ERROR 
> (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014 to [email protected]:5050
> I1101 20:20:14.138978 102214 slave.cpp:3961] Sending acknowledgement for 
> status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for 
> task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:20:14.139230 102232 linux_launcher.cpp:541] Using freezer to destroy 
> cgroup mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.139691 102227 cpu.cpp:101] Updated 'cpu.shares' to 102 (cpus 
> 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.140110 102235 cgroups.cpp:2705] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.141026 102227 cpu.cpp:121] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 
> da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.243934 102236 cgroups.cpp:1439] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 
> 103.768064ms
> I1101 20:20:14.246510 102236 cgroups.cpp:2723] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.249119 102218 cgroups.cpp:1468] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 
> 2.564096ms
> I1101 20:20:14.945940 102213 slave.cpp:4179] Got exited event for 
> executor(1)@10.70.142.140:36929
> I1101 20:20:18.942217 102236 http.cpp:277] HTTP GET for /slave(1)/state from 
> 10.70.142.140:52050 with User-Agent='python-requests/2.4.3 CPython/2.7.10 
> Linux/4.4.82'
> I1101 20:20:20.412132 102235 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:20.412247 102235 status_update_manager.cpp:832] Checkpointing ACK 
> for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for 
> task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014
> W1101 20:20:20.665359 102213 slave.cpp:2669] Ignoring updating pid for 
> framework db61f6d4-fd0f-48be-927d-14282c12301f-0005 because it does not exist
> I1101 20:20:28.572569 102237 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:48290 with User-Agent='Go-http-client/1.1'
> W1101 20:20:28.573552 102215 containerizer.cpp:1876] Skipping status for 
> container da3ecaac-35dc-4464-8204-76577dcde2a8 because: Container does not 
> exist
> I1101 20:20:28.619093 102233 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:48292 with User-Agent='Go-http-client/1.1'
> I1101 20:20:37.789496 102218 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:58214 with User-Agent='filebundle-agent'
> I1101 20:21:01.325791 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:55670 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.333392 102225 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:55691 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.335985 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:55704 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.428691 102241 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:55980 with User-Agent='Python-urllib/2.7'
> I1101 20:21:02.247719 102214 slave.cpp:5044] Current disk usage 0.23%. Max 
> allowed age: 6.283563174420139days
> I1101 20:21:07.789693 102233 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:44932 with User-Agent='filebundle-agent'
> I1101 20:21:09.948168 102236 http.cpp:277] HTTP GET for /slave(1)/state from 
> 10.70.142.140:51804 with User-Agent='python-requests/2.4.3 CPython/2.7.10 
> Linux/4.4.82'
> E1101 20:21:14.140836 102231 slave.cpp:4520] Termination of executor 
> 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014 failed: Failed to kill all 
> processes in the container: Timed out after 1mins
> I1101 20:21:14.140986 102231 slave.cpp:4646] Cleaning up executor 
> 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework 
> db61f6d4-fd0f-48be-927d-14282c12301f-0014 at executor(1)@10.70.142.140:36929
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to