[ 
https://issues.apache.org/jira/browse/MESOS-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738609#comment-16738609
 ] 

Chun-Hung Hsiao commented on MESOS-9517:
----------------------------------------

It seems to me that we should address this with MESOS-8400.

> SLRP should treat gRPC timeouts as non-terminal errors, instead of reporting 
> OPERATION_FAILED.
> ----------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9517
>                 URL: https://issues.apache.org/jira/browse/MESOS-9517
>             Project: Mesos
>          Issue Type: Bug
>          Components: resource provider, storage
>            Reporter: James DeFelice
>            Assignee: Chun-Hung Hsiao
>            Priority: Major
>              Labels: mesosphere
>
> 1. framework executes a CREATE_DISK operation.
> 2. The SLRP issues a CreateVolume RPC to the plugin
> 3. The RPC call times out
> 4. The agent/SLRP translates non-terminal gRPC timeout errors 
> (DeadlineExceeded) for "CreateVolume" calls into OPERATION_FAILED, which is 
> terminal.
> 5. framework receives a *terminal* OPERATION_FAILED status, so it executes 
> another CREATE_DISK operation.
> 6. The second CREATE_DISK operation does not timeout.
> 7. The first CREATE_DISK operation was actually completed by the plugin, 
> unbeknownst to the SLRP.
> 8. There's now an orphan volume in the storage system that no one is tracking.
> Proposed solution: the SLRP makes more intelligent decisions about 
> non-terminal gRPC errors. For example, timeouts are likely expected for 
> potentially long-running storage operations and should not be considered 
> terminal. In such cases, the SLRP should NOT report OPERATION_FAILED and 
> instead should re-issue the **same** (idempotent) CreateVolume call to the 
> plugin to ascertain the status of the requested volume creation.
> Agent logs for the 3 orphan vols above:
> {code}
> [jdefelice@ec101 DCOS-46889]$ grep -e 3bd1a1a9-43d3-485c-9275-59cebd64b07c 
> agent.log
> Jan 09 11:10:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:10:27.896306 13189 provider.cpp:1548] Received CREATE_DISK operation 
> 'a1BdfrEhy4ZLSNPZbDrzp1h-0' (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c)
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> E0109 11:11:27.904057 13190 provider.cpp:1605] Failed to apply operation 
> (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c): Deadline Exceeded
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.904058 13192 status_update_manager_process.hpp:152] Received 
> operation status update OPERATION_FAILED (Status UUID: 
> 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 
> 3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 
> 'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.904331 13192 status_update_manager_process.hpp:929] 
> Checkpointing UPDATE for operation status update OPERATION_FAILED (Status 
> UUID: 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 
> 3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 
> 'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.947286 13189 slave.cpp:7696] Handling resource provider 
> message 'UPDATE_OPERATION_STATUS: (uuid: 
> 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for framework 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: OPERATION_FAILED, 
> status update state: OPERATION_FAILED)'
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.947376 13189 slave.cpp:8034] Updating the state of operation 
> 'a1BdfrEhy4ZLSNPZbDrzp1h-0' (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for 
> framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: 
> OPERATION_FAILED, status update state: OPERATION_FAILED)
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.947407 13189 slave.cpp:7890] Forwarding status update of 
> operation 'a1BdfrEhy4ZLSNPZbDrzp1h-0' (operation_uuid: 
> 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for framework 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-0002
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.952689 13193 status_update_manager_process.hpp:252] Received 
> operation status update acknowledgement (UUID: 
> 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for stream 
> 3bd1a1a9-43d3-485c-9275-59cebd64b07c
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.952725 13193 status_update_manager_process.hpp:929] 
> Checkpointing ACK for operation status update OPERATION_FAILED (Status UUID: 
> 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 
> 3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 
> 'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> [jdefelice@ec101 DCOS-46889]$ grep -e 4acf1495-1a36-4939-a71b-75ca5aa73657 
> agent.log
> Jan 09 11:10:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:10:28.452811 13192 provider.cpp:1548] Received CREATE_DISK operation 
> 'a5MU6JqxYpT9IWXM75cwuHO-0' (uuid: 4acf1495-1a36-4939-a71b-75ca5aa73657)
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> E0109 11:11:28.460510 13190 provider.cpp:1605] Failed to apply operation 
> (uuid: 4acf1495-1a36-4939-a71b-75ca5aa73657): Deadline Exceeded
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:28.460511 13186 status_update_manager_process.hpp:152] Received 
> operation status update OPERATION_FAILED (Status UUID: 
> e810608b-58ac-47eb-bf19-9abcca6907a2) for operation UUID 
> 4acf1495-1a36-4939-a71b-75ca5aa73657 (framework-supplied ID 
> 'a5MU6JqxYpT9IWXM75cwuHO-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:28.460793 13186 status_update_manager_process.hpp:929] 
> Checkpointing UPDATE for operation status update OPERATION_FAILED (Status 
> UUID: e810608b-58ac-47eb-bf19-9abcca6907a2) for operation UUID 
> 4acf1495-1a36-4939-a71b-75ca5aa73657 (framework-supplied ID 
> 'a5MU6JqxYpT9IWXM75cwuHO-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:28.504062 13191 slave.cpp:7696] Handling resource provider 
> message 'UPDATE_OPERATION_STATUS: (uuid: 
> 4acf1495-1a36-4939-a71b-75ca5aa73657) for framework 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: OPERATION_FAILED, 
> status update state: OPERATION_FAILED)'
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:28.504133 13191 slave.cpp:8034] Updating the state of operation 
> 'a5MU6JqxYpT9IWXM75cwuHO-0' (uuid: 4acf1495-1a36-4939-a71b-75ca5aa73657) for 
> framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: 
> OPERATION_FAILED, status update state: OPERATION_FAILED)
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:28.504159 13191 slave.cpp:7890] Forwarding status update of 
> operation 'a5MU6JqxYpT9IWXM75cwuHO-0' (operation_uuid: 
> 4acf1495-1a36-4939-a71b-75ca5aa73657) for framework 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-0002
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:28.509495 13194 status_update_manager_process.hpp:252] Received 
> operation status update acknowledgement (UUID: 
> e810608b-58ac-47eb-bf19-9abcca6907a2) for stream 
> 4acf1495-1a36-4939-a71b-75ca5aa73657
> Jan 09 11:11:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:28.509521 13194 status_update_manager_process.hpp:929] 
> Checkpointing ACK for operation status update OPERATION_FAILED (Status UUID: 
> e810608b-58ac-47eb-bf19-9abcca6907a2) for operation UUID 
> 4acf1495-1a36-4939-a71b-75ca5aa73657 (framework-supplied ID 
> 'a5MU6JqxYpT9IWXM75cwuHO-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> [jdefelice@ec101 DCOS-46889]$ grep -e ca2bed2f-480e-4d35-af9e-1161a44c5b9b 
> agent.log
> Jan 09 11:10:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:10:27.458933 13186 provider.cpp:1548] Received CREATE_DISK operation 
> 'a3AvAF97UsHU6zIIPhyGdrY-0' (uuid: ca2bed2f-480e-4d35-af9e-1161a44c5b9b)
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> E0109 11:11:27.469853 13189 provider.cpp:1605] Failed to apply operation 
> (uuid: ca2bed2f-480e-4d35-af9e-1161a44c5b9b): Deadline Exceeded
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.469859 13186 status_update_manager_process.hpp:152] Received 
> operation status update OPERATION_FAILED (Status UUID: 
> bb7807e8-dc2f-4f64-b611-d24a1e559317) for operation UUID 
> ca2bed2f-480e-4d35-af9e-1161a44c5b9b (framework-supplied ID 
> 'a3AvAF97UsHU6zIIPhyGdrY-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.470120 13186 status_update_manager_process.hpp:929] 
> Checkpointing UPDATE for operation status update OPERATION_FAILED (Status 
> UUID: bb7807e8-dc2f-4f64-b611-d24a1e559317) for operation UUID 
> ca2bed2f-480e-4d35-af9e-1161a44c5b9b (framework-supplied ID 
> 'a3AvAF97UsHU6zIIPhyGdrY-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.513059 13192 slave.cpp:7696] Handling resource provider 
> message 'UPDATE_OPERATION_STATUS: (uuid: 
> ca2bed2f-480e-4d35-af9e-1161a44c5b9b) for framework 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: OPERATION_FAILED, 
> status update state: OPERATION_FAILED)'
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.513129 13192 slave.cpp:8034] Updating the state of operation 
> 'a3AvAF97UsHU6zIIPhyGdrY-0' (uuid: ca2bed2f-480e-4d35-af9e-1161a44c5b9b) for 
> framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: 
> OPERATION_FAILED, status update state: OPERATION_FAILED)
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.513147 13192 slave.cpp:7890] Forwarding status update of 
> operation 'a3AvAF97UsHU6zIIPhyGdrY-0' (operation_uuid: 
> ca2bed2f-480e-4d35-af9e-1161a44c5b9b) for framework 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-0002
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.518623 13191 status_update_manager_process.hpp:252] Received 
> operation status update acknowledgement (UUID: 
> bb7807e8-dc2f-4f64-b611-d24a1e559317) for stream 
> ca2bed2f-480e-4d35-af9e-1161a44c5b9b
> Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
> I0109 11:11:27.518656 13191 status_update_manager_process.hpp:929] 
> Checkpointing ACK for operation status update OPERATION_FAILED (Status UUID: 
> bb7807e8-dc2f-4f64-b611-d24a1e559317) for operation UUID 
> ca2bed2f-480e-4d35-af9e-1161a44c5b9b (framework-supplied ID 
> 'a3AvAF97UsHU6zIIPhyGdrY-0') of framework 
> 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
> c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to