[
https://issues.apache.org/jira/browse/HDDS-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze resolved HDDS-9876.
------------------------------
Fix Version/s: 1.5.0
Resolution: Fixed
The pull request is now merged. Thanks, [~Sammi]!
> OzoneManagerStateMachine should add response to OzoneManagerDoubleBuffer for
> every write request
> ------------------------------------------------------------------------------------------------
>
> Key: HDDS-9876
> URL: https://issues.apache.org/jira/browse/HDDS-9876
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.5.0
>
>
> This task is to resolve the issues in HDDS-9342.
> HDDS-2680 introduced a logic in OzoneManagerStateMachine to calculate the
> lastAppliedTermIndex based on two maps, applyTransactionMap and
> ratisTransactionMap. Any write request from RATIS through applyTransaction
> will add its trxLogIndex into applyTransactionMap. And any write request
> which is flushed by OzoneManagerDoubleBuffer#flushBatch will have its
> trxLogIndex removed from applyTransactionMap during flushBatch call
> ozoneManagerRatisSnapShot.updateLastAppliedIndex(flushedEpochs).
> If any write request from RATIS not going through
> OzoneManagerDoubleBuffer#flushBatch, then its trxLogIndex will be left in the
> applyTransactionMap forever. Since lastApplicedIndex can only be updated
> incrementally, any trxLogIndex not confirmed by OzoneManagerDoubleBuffer
> flush will make the lastApplicedIndex grow stops before it, and although
> write requests after that unconfirmed one could be flushed, but their
> trxLogIndex will be added to the ratisTransactionMap, which causes the
> ratisTransactionMap grow bigger and bigger.
> How a write request will not be confirmed by OzoneManagerDoubleBuffer flush?
> Here is one case reproduced locally.
> T1: create bucket1
> T2: client1 sends delete bucket "bucket1" request to OM. OM verify bucket1
> exists, then send request to RATIS to handle the request.
> T3: client2 sends create key "bucket1/key1" request to OM. OM verify bucket2
> exists, then send request to RATIS
> T4: OzoneManagerStateMachine executes delete bucket "bucket1" success, return
> response to client1
> T5: OzoneManagerStateMachine executes create key "bucket1/key1" request,
> "bucket1" cannot be found, execution fails, return failure to client2
> In T5, the failure stack is
> {code:java}
> 2023-10-18 19:04:10,131 [OM StateMachine ApplyTransaction Thread - 0] WARN
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Failed to write,
> Exception occurred
> BUCKET_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Bucket
> not found: s3v/prod-voyager
> at
> org.apache.hadoop.ozone.om.OzoneManagerUtils.reportNotFound(OzoneManagerUtils.java:87)
> at
> org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketInfo(OzoneManagerUtils.java:72)
> at
> org.apache.hadoop.ozone.om.OzoneManagerUtils.resolveBucketInfoLink(OzoneManagerUtils.java:148)
> at
> org.apache.hadoop.ozone.om.OzoneManagerUtils.getResolvedBucketInfo(OzoneManagerUtils.java:124)
> at
> org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketLayout(OzoneManagerUtils.java:106)
> at
> org.apache.hadoop.ozone.om.request.BucketLayoutAwareOMKeyRequestFactory.createRequest(BucketLayoutAwareOMKeyRequestFactory.java:230)
> at
> org.apache.hadoop.ozone.om.ratis.utils.OzoneManagerRatisUtils.createClientRequest(OzoneManagerRatisUtils.java:336)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:380)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
> In OzoneManagerStateMachine.runCommand, when IOException is throw out from
> OzoneManagerRequestHandler.handleWriteRequest, it constructs and returns
> OMResponse to client, it doesn't add the response into
> OzoneManagerDoubleBuffer, so OzoneManagerDoubleBuffer doesn't aware of this
> request and its trxLogIndex. The consequence is this trxLogIndex will be stay
> in applyTransactionMap forever.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]