Siyao Meng created HDDS-9146:
--------------------------------
Summary: Potential data loss with HSync
Key: HDDS-9146
URL: https://issues.apache.org/jira/browse/HDDS-9146
Project: Apache Ozone
Issue Type: Bug
Reporter: Siyao Meng
It is observed that when {{hsync()}} is called followed by a {{close()}} for a
key stream (which triggers two {{OMKeyCommitRequest}}, the first one with
{{isHSync = true}} and the second one with {{isHSync = false}}),
{{deletedTable}} could have an entry with the exact same block containerId and
locId, which can cause OM's {{KeyDeletingService}} to remove the committed
block by mistake. causing data loss when the container is closed, after which
block deletion will then actually happen on DNs.
Repro integration test branch (based on [~erose]'s integration test in turn
based on my initial draft):
{code:title=Test log. See entry in keyTable and deletedTable has the same block
conID: 1 and locID: 111677748019200001}
2023-08-09 14:31:54,859 [main] WARN ozone.TestMiniOzoneCluster
(TestMiniOzoneCluster.java:testKeyRenameDirDelete(159)) - keyTable: -----
START -----
2023-08-09 14:31:54,860 [main] WARN ozone.TestMiniOzoneCluster
(TestMiniOzoneCluster.java:testKeyRenameDirDelete(168)) - keyTable: key =
/testozonevol/testozonebucket/inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001,
val = OmKeyInfo{volumeName='testozonevol', bucketName='testozonebucket',
keyName='inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001',
dataSize=11, keyLocationVersions=[OmKeyLocationInfoGroup{version=0,
locationVersionMap={0=[{blockID={conID: 1 locID: 111677748019200001 bcsId: 2},
length=11, offset=0, token=null, pipeline=null, createVersion=0,
partNumber=0}]}, isMultipartKey=false}], creationTime=1691616714661,
modificationTime=1691616714848, replicationConfig=RATIS/THREE, encInfo=null,
fileChecksum=null, isFile=true, fileName='part-m-00001'}
2023-08-09 14:31:54,860 [main] WARN ozone.TestMiniOzoneCluster
(TestMiniOzoneCluster.java:testKeyRenameDirDelete(171)) - keyTable: -----
END -----
2023-08-09 14:31:54,860 [main] WARN ozone.TestMiniOzoneCluster
(TestMiniOzoneCluster.java:testKeyRenameDirDelete(173)) - deletedTable: -----
START -----
2023-08-09 14:31:54,861 [main] WARN ozone.TestMiniOzoneCluster
(TestMiniOzoneCluster.java:testKeyRenameDirDelete(181)) - deletedTable: key =
/testozonevol/testozonebucket/inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001/-9223372036854774528,
val = RepeatedOmKeyInfo{omKeyInfoList=[OmKeyInfo{volumeName='testozonevol',
bucketName='testozonebucket',
keyName='inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001',
dataSize=11, keyLocationVersions=[OmKeyLocationInfoGroup{version=0,
locationVersionMap={0=[{blockID={conID: 1 locID: 111677748019200001 bcsId: 0},
length=11, offset=0, token=null, pipeline=null, createVersion=0,
partNumber=0}]}, isMultipartKey=false}], creationTime=1691616714661,
modificationTime=1691616714834, replicationConfig=RATIS/THREE, encInfo=null,
fileChecksum=null, isFile=true, fileName='part-m-00001'}]}
2023-08-09 14:31:54,861 [main] WARN ozone.TestMiniOzoneCluster
(TestMiniOzoneCluster.java:testKeyRenameDirDelete(184)) - deletedTable: -----
END -----
{code}
Sounds to me the fix should be to filter out any block that shares the same
containerId and locId when adding entry into deletedTable inside
OMKeyCommitRequest / OMKeyCommitRequestWithFSO. But I'm no expert in HSync so
please advise. cc [~weichiu] [~szetszwo]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]