[
https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324623#comment-16324623
]
Benjamin Mahler commented on MESOS-8422:
----------------------------------------
Should this be updated to a 'Blocker' or is it ok being re-targeted to 1.5.1?
> Master's UpdateSlave handler not correctly updating terminated operations
> -------------------------------------------------------------------------
>
> Key: MESOS-8422
> URL: https://issues.apache.org/jira/browse/MESOS-8422
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.5.0
> Reporter: Gastón Kleiman
> Assignee: Benjamin Bannier
> Labels: mesosphere
>
> I created a test that verifies that operation status updates are resent to
> the master after being dropped en route to it (MESOS-8420).
> The test does the following:
> # Creates a volume from a RAW disk resource.
> # Drops the first `UpdateOperationStatusMessage` message from the agent to
> the master, so that it isn't acknowledged by the master.
> # Restarts the agent.
> # Verifies that the agent resends the operation status update.
> The good news are that the agent is resending the operation status update,
> the bad news are that it triggers a CHECK failure that crashes the master.
> Here are the relevant sections of the log produced by the test:
> {noformat}
> [ RUN ]
> StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
> [...]
> I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for
> offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681
> (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default)
> at [email protected]:40681
> I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME
> operation with source disk(allocated: storage)(reservations:
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default) at
> [email protected]:40681 to agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid:
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
> I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME
> operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest
> '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
> I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from
> 'disk(allocated: storage)(reservations:
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated:
> storage)(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
> for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received
> operation status update OPERATION_FINISHED (Status UUID:
> 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework
> '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929]
> Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status
> UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework
> '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for
> /slave(2)/api/v1/resource_provider from 10.0.49.2:53598
> I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider
> message 'UPDATE_OPERATION_STATUS: (uuid:
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)'
> I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation '
> with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)
> I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of
> operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
> I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating
> I0109 16:36:08.584115 24144 master.cpp:1305] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681
> (core-dev) disconnected
> [...]
> I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master
> [email protected]:40681
> I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming
> sending task status updates
> I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> I0109 16:36:08.656008 24140 slave.cpp:1423] Forwarding agent update
> {"operations":{},"resource_version_uuid":{"value":"icuAKyO6TymMt2Y9vyF6Jg=="},"slave_id":{"value":"046b3f21-6e97-4a56-9a13-773f7d481efd-S0"},"update_oversubscribed_resources":true}
> I0109 16:36:08.656121 24149 hierarchical.cpp:754] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 reactivated
> W0109 16:36:08.656481 24113 master.cpp:7277] !!!! update slave message:
> slave_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> update_oversubscribed_resources: true
> operations {
> }
> resource_version_uuid {
> value: "\211\313\200+#\272O)\214\267f=\277!z&"
> }
> I0109 16:36:08.656637 24113 master.cpp:7320] Received update of agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(3)@10.0.49.2:40681
> (core-dev) with total oversubscribed resources {}
> W0109 16:36:08.657387 24113 master.cpp:7704] Performing explicit
> reconciliation with agent for known operation
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 since it was not present in original
> reconciliation message from agent
> I0109 16:36:08.657917 24133 hierarchical.cpp:669] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> W0109 16:36:08.658048 24125 manager.cpp:472] Dropping operation
> reconciliation message with operation_uuid
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 because resource provider
> 605b22f5-e39d-4d9f-950a-e7f44d202c01 is not subscribed
> I0109 16:36:08.658609 24143 container_daemon.cpp:119] Launching container
> 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-slrp_test--CONTROLLER_SERVICE-NODE_SERVICE'
> [...]
> I0109 16:36:08.689859 24130 provider.cpp:3066] Sending UPDATE_STATE call with
> resources 'disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
> and 1 operations to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.690449 24130 provider.cpp:1042] Resource provider
> 605b22f5-e39d-4d9f-950a-e7f44d202c01 is in READY state
> I0109 16:36:08.690491 24105 status_update_manager_process.hpp:385] Resuming
> operation status update manager
> I0109 16:36:08.690640 24105 status_update_manager_process.hpp:394] Sending
> operation status update OPERATION_FINISHED (Status UUID:
> 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework
> '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.693244 24131 http.cpp:1185] HTTP POST for
> /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
> I0109 16:36:08.693912 24140 http.cpp:1185] HTTP POST for
> /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
> I0109 16:36:08.693974 24115 manager.cpp:677] Received UPDATE_STATE call with
> resources
> '[{"disk":{"source":{"id":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","metadata":{"labels":[{"key":"path","value":"\/tmp\/n5thZ3\/test\/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"}]},"mount":{"root":".\/csi\/org.apache.mesos.csi.test\/slrp_test\/mounts\/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"},"profile":"volume-default","type":"MOUNT"}},"name":"disk","provider_id":{"value":"605b22f5-e39d-4d9f-950a-e7f44d202c01"},"reservations":[{"role":"storage","type":"DYNAMIC"}],"scalar":{"value":4096.0},"type":"SCALAR"}]'
> and 1 operations from resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01
> I0109 16:36:08.694897 24144 slave.cpp:7065] Handling resource provider
> message 'UPDATE_STATE: 605b22f5-e39d-4d9f-950a-e7f44d202c01
> disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
> I0109 16:36:08.695184 24144 slave.cpp:7182] Forwarding new total resources
> cpus:2; mem:1024; disk:1024; ports:[31000-32000]; disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096
> I0109 16:36:08.696467 24144 slave.cpp:7065] Handling resource provider
> message 'UPDATE_OPERATION_STATUS: (uuid:
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)'
> I0109 16:36:08.696594 24144 slave.cpp:7409] Updating the state of operation '
> with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)
> I0109 16:36:08.696666 24144 slave.cpp:7249] Forwarding status update of
> operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
> W0109 16:36:08.697093 24142 master.cpp:7277] !!!! update slave message:
> slave_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> update_oversubscribed_resources: false
> operations {
> }
> resource_version_uuid {
> value: "\211\313\200+#\272O)\214\267f=\277!z&"
> }
> resource_providers {
> providers {
> info {
> id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> type: "org.apache.mesos.rp.local.storage"
> name: "test"
> default_reservations {
> role: "storage"
> type: DYNAMIC
> }
> storage {
> plugin {
> type: "org.apache.mesos.csi.test"
> name: "slrp_test"
> containers {
> [...]
> }
> }
> }
> }
> total_resources {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: MOUNT
> mount {
> root:
> "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> metadata {
> labels {
> key: "path"
> value:
> "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> }
> profile: "volume-default"
> }
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> operations {
> operations {
> framework_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-0000"
> }
> slave_id {
> value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
> }
> info {
> type: CREATE_VOLUME
> create_volume {
> source {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: RAW
> profile: "volume-default"
> }
> }
> allocation_info {
> role: "storage"
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> target_type: MOUNT
> }
> }
> latest_status {
> state: OPERATION_FINISHED
> converted_resources {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: MOUNT
> mount {
> root:
> "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> metadata {
> labels {
> key: "path"
> value:
> "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> }
> profile: "volume-default"
> }
> }
> allocation_info {
> role: "storage"
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> uuid {
> value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
> }
> }
> statuses {
> state: OPERATION_FINISHED
> converted_resources {
> name: "disk"
> type: SCALAR
> scalar {
> value: 4096
> }
> disk {
> source {
> type: MOUNT
> mount {
> root:
> "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> metadata {
> labels {
> key: "path"
> value:
> "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
> }
> }
> profile: "volume-default"
> }
> }
> allocation_info {
> role: "storage"
> }
> provider_id {
> value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
> }
> reservations {
> role: "storage"
> type: DYNAMIC
> }
> }
> uuid {
> value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
> }
> }
> uuid {
> value: "\030\264\304\245\321bM\317\273!\241<n\340\364\010"
> }
> }
> }
> resource_version_uuid {
> value: "M\250\313j\320\301IG\262\0164e\004\367\304\333"
> }
> }
> }
> I0109 16:36:08.700137 24142 master.cpp:10411] Updating the state of operation
> '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) of framework
> 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED,
> status update state: OPERATION_FINISHED)
> I0109 16:36:08.700417 24146 hierarchical.cpp:669] Agent
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total
> resources disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096;
> cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F0109 16:36:08.700610 24142 master.cpp:11687] CHECK_SOME(resources):
> disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096;
> cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain
> disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
> *** Check failure stack trace: ***
> F0109 16:36:08.700896 24146 hierarchical.cpp:908] CHECK_SOME(updatedTotal):
> disk(reservations:
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096;
> cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain
> disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
> *** Check failure stack trace: ***
> @ 0x7ff06d3bbe7e (unknown)
> @ 0x7ff06d3bbe7e (unknown)
> @ 0x7ff06d3bbddd (unknown)
> @ 0x7ff06d3bbddd (unknown)
> @ 0x7ff06d3bb7ee (unknown)
> @ 0x7ff06d3bb7ee (unknown)
> @ 0x7ff06d3be522 (unknown)
> @ 0x55c1c6c2be77
> _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
> @ 0x7ff06d3be522 (unknown)
> @ 0x55c1c6c2be77
> _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
> @ 0x7ff06b729277 (unknown)
> @ 0x55c1c6f3be8a
> _ZTSN6lambda12CallableOnceIFvRK6ResultIN5mesos2v117resource_provider5EventEEEE10CallableFnINS_8internal7PartialIZNK7process6FutureIS6_E7onReadyISt5_BindIFSt7_Mem_fnIMSG_FbS8_EESG_St12_PlaceholderILi1EEEEbEERKSG_OT_NSG_6PreferEEUlOSQ_S8_E_ISQ_SO_EEEEE
> {noformat}
> We can see that once the SLRP reregisters with the agent, the following
> happens:
> # The agent will send an {{UpdateSlave}} message to the master including the
> converted resources and the {{CREATE_VOLUME}} operation with the status
> {{OPERATION_FINISHED}}.
> # The master will update the agent's resources, including the volume created
> by the operation.
> # The agent will resend the operation status update.
> # The master will try to apply the operation and crash, because it already
> updated the agent's resources on step #2.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)