[
https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320357#comment-16320357
]
Benjamin Bannier commented on MESOS-8422:
-----------------------------------------
The issue here seems to be that resources of the resource provider sent as part
of the {{UpdateSlaveMessage}} after failover already had the operation applied;
applying the same operation again when an offer operation status update is
received cannot work.
We should update the master to transition newly terminal offer operations when
it learns about that from {{UpdateSlaveMessage}}. Since the master will only
update resource state in the offer operation status update handler when the
operation is _newly_ terminal, this would prevent this setup from becoming a
problem.
> Master's UpdateSlave handler not correctly updating terminated operations
> -------------------------------------------------------------------------
>
> Key: MESOS-8422
> URL: https://issues.apache.org/jira/browse/MESOS-8422
> Project: Mesos
> Issue Type: Bug
> Reporter: Gastón Kleiman
> Assignee: Benjamin Bannier
> Labels: mesosphere
>
> I created a test that verifies that operation status updates are resent to
> the master after being dropped en route to it (MESOS-8420).
> The test does the following:
> # Creates a volume from a RAW disk resource.
> # Drops the first `UpdateOperationStatusMessage` message from the agent to
> the master, so that it isn't acknowledged by the master.
> # Restarts the agent.
> # Verifies that the agent resends the operation status update.
> The good news are that the agent is resending the operation status update,
> the bad news are that it triggers a CHECK failure that crashes the master.
> Here are the relevant sections of the log produced by the test:
> {noformat}
> [ RUN ]
> StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
> [...]
> I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for
> offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681
> (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default)
> at [email protected]:40681
> I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME
> operation with source disk(allocated: storage)(reservations:
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default) at
> [email protected]:40681 to agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid:
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
> I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME
> operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest
> '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
> I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from
> 'disk(allocated: storage)(reservations:
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated:
> storage)(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
> for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received
> operation status update OPERATION_FINISHED (Status UUID:
> 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework
> '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929]
> Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status
> UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework
> '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for
> /slave(2)/api/v1/resource_provider from 10.0.49.2:53598
> I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider
> message 'UPDATE_OPERATION_STATUS: (uuid:
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)'
> I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation '
> with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)
> I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of
> operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
> I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating
> I0109 16:36:08.584115 24144 master.cpp:1305] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681
> (core-dev) disconnected
> [...]
> I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master
> [email protected]:40681
> I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming
> sending task status updates
> I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> I0109 16:36:08.656008 24140 slave.cpp:1423] Forwarding agent update
> {"operations":{},"resource_version_uuid":{"value":"icuAKyO6TymMt2Y9vyF6Jg=="},"slave_id":{"value":"046b3f21-6e97-4a56-9a13-773f7d481efd-S0"},"update_oversubscribed_resources":true}
> I0109 16:36:08.656121 24149 hierarchical.cpp:754] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 reactivated
> W0109 16:36:08.656481 24113 master.cpp:7277] !!!! update slave message:
> slave_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> update_oversubscribed_resources: true
> operations {
> }
> resource_version_uuid {
> value: "\211\313\200+#\272O)\214\267f=\277!z&"
> }
> I0109 16:36:08.656637 24113 master.cpp:7320] Received update of agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(3)@10.0.49.2:40681
> (core-dev) with total oversubscribed resources {}
> W0109 16:36:08.657387 24113 master.cpp:7704] Performing explicit
> reconciliation with agent for known operation
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 since it was not present in original
> reconciliation message from agent
> I0109 16:36:08.657917 24133 hierarchical.cpp:669] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> W0109 16:36:08.658048 24125 manager.cpp:472] Dropping operation
> reconciliation message with operation_uuid
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 because resource provider
> 605b22f5-e39d-4d9f-950a-e7f44d202c01 is not subscribed
> I0109 16:36:08.658609 24143 container_daemon.cpp:119] Launching container
> 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-slrp_test--CONTROLLER_SERVICE-NODE_SERVICE'
> [...]
> I0109 16:36:08.689859 24130 provider.cpp:3066] Sending UPDATE_STATE call with
> resources 'disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
> and 1 operations to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.690449 24130 provider.cpp:1042] Resource provider
> 605b22f5-e39d-4d9f-950a-e7f44d202c01 is in READY state
> I0109 16:36:08.690491 24105 status_update_manager_process.hpp:385] Resuming
> operation status update manager
> I0109 16:36:08.690640 24105 status_update_manager_process.hpp:394] Sending
> operation status update OPERATION_FINISHED (Status UUID:
> 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework
> '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.693244 24131 http.cpp:1185] HTTP POST for
> /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
> I0109 16:36:08.693912 24140 http.cpp:1185] HTTP POST for
> /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
> I0109 16:36:08.693974 24115 manager.cpp:677] Received UPDATE_STATE call with
> resources
> '[{"disk":{"source":{"id":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","metadata":{"labels":[{"key":"path","value":"\/tmp\/n5thZ3\/test\/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"}]},"mount":{"root":".\/csi\/org.apache.mesos.csi.test\/slrp_test\/mounts\/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"},"profile":"volume-default","type":"MOUNT"}},"name":"disk","provider_id":{"value":"605b22f5-e39d-4d9f-950a-e7f44d202c01"},"reservations":[{"role":"storage","type":"DYNAMIC"}],"scalar":{"value":4096.0},"type":"SCALAR"}]'
> and 1 operations from resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01
> I0109 16:36:08.694897 24144 slave.cpp:7065] Handling resource provider
> message 'UPDATE_STATE: 605b22f5-e39d-4d9f-950a-e7f44d202c01
> disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
> I0109 16:36:08.695184 24144 slave.cpp:7182] Forwarding new total resources
> cpus:2; mem:1024; disk:1024; ports:[31000-32000]; disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096
> I0109 16:36:08.696467 24144 slave.cpp:7065] Handling resource provider
> message 'UPDATE_OPERATION_STATUS: (uuid:
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)'
> I0109 16:36:08.696594 24144 slave.cpp:7409] Updating the state of operation '
> with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)
> I0109 16:36:08.696666 24144 slave.cpp:7249] Forwarding status update of
> operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
> W0109 16:36:08.697093 24142 master.cpp:7277] !!!! update slave message:
> slave_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> update_oversubscribed_resources: false
> operations {
> }
> resource_version_uuid {
> value: "\211\313\200+#\272O)\214\267f=\277!z&"
> }
> resource_providers {
> providers {
> info {
> id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> type: "org.apache.mesos.rp.local.storage"
> name: "test"
> default_reservations {
> role: "storage"
> type: DYNAMIC
> }
> storage {
> plugin {
> type: "org.apache.mesos.csi.test"
> name: "slrp_test"
> containers {
> [...]
> }
> }
> }
> }
> total_resources {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: MOUNT
> mount {
> root:
> "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> metadata {
> labels {
> key: "path"
> value:
> "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> }
> profile: "volume-default"
> }
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> operations {
> operations {
> framework_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-0000"
> }
> slave_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> info {
> type: CREATE_VOLUME
> create_volume {
> source {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: RAW
> profile: "volume-default"
> }
> }
> allocation_info {
> role: "storage"
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> target_type: MOUNT
> }
> }
> latest_status {
> state: OPERATION_FINISHED
> converted_resources {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: MOUNT
> mount {
> root:
> "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> metadata {
> labels {
> key: "path"
> value:
> "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> }
> profile: "volume-default"
> }
> }
> allocation_info {
> role: "storage"
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> uuid {
> value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
> }
> }
> statuses {
> state: OPERATION_FINISHED
> converted_resources {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: MOUNT
> mount {
> root:
> "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> metadata {
> labels {
> key: "path"
> value:
> "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> }
> profile: "volume-default"
> }
> }
> allocation_info {
> role: "storage"
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> uuid {
> value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
> }
> }
> uuid {
> value: "\030\264\304\245\321bM\317\273!\241<n\340\364\010"
> }
> }
> }
> resource_version_uuid {
> value: "M\250\313j\320\301IG\262\0164e\004\367\304\333"
> }
> }
> }
> I0109 16:36:08.700137 24142 master.cpp:10411] Updating the state of operation
> '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) of framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)
> I0109 16:36:08.700417 24146 hierarchical.cpp:669] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total
> resources disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096;
> cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F0109 16:36:08.700610 24142 master.cpp:11687] CHECK_SOME(resources):
> disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096;
> cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain
> disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
> *** Check failure stack trace: ***
> F0109 16:36:08.700896 24146 hierarchical.cpp:908] CHECK_SOME(updatedTotal):
> disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096;
> cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain
> disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
> *** Check failure stack trace: ***
> @ 0x7ff06d3bbe7e (unknown)
> @ 0x7ff06d3bbe7e (unknown)
> @ 0x7ff06d3bbddd (unknown)
> @ 0x7ff06d3bbddd (unknown)
> @ 0x7ff06d3bb7ee (unknown)
> @ 0x7ff06d3bb7ee (unknown)
> @ 0x7ff06d3be522 (unknown)
> @ 0x55c1c6c2be77
> _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
> @ 0x7ff06d3be522 (unknown)
> @ 0x55c1c6c2be77
> _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
> @ 0x7ff06b729277 (unknown)
> @ 0x55c1c6f3be8a
> _ZTSN6lambda12CallableOnceIFvRK6ResultIN5mesos2v117resource_provider5EventEEEE10CallableFnINS_8internal7PartialIZNK7process6FutureIS6_E7onReadyISt5_BindIFSt7_Mem_fnIMSG_FbS8_EESG_St12_PlaceholderILi1EEEEbEERKSG_OT_NSG_6PreferEEUlOSQ_S8_E_ISQ_SO_EEEEE
> {noformat}
> We can see that once the SLRP reregisters with the agent, the following
> happens:
> # The agent will send an {{UpdateSlave}} message to the master including the
> converted resources and the {{CREATE_VOLUME}} operation with the status
> {{OPERATION_FINISHED}}.
> # The master will update the agent's resources, including the volume created
> by the operation.
> # The agent will resend the operation status update.
> # The master will try to apply the operation and crash, because it already
> updated the agent's resources on step #2.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)