[ 
https://issues.apache.org/jira/browse/HDDS-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDDS-9876.
------------------------------
    Fix Version/s: 1.5.0
       Resolution: Fixed

The pull request is now merged.  Thanks, [~Sammi]!

> OzoneManagerStateMachine should add response to OzoneManagerDoubleBuffer for 
> every write request
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-9876
>                 URL: https://issues.apache.org/jira/browse/HDDS-9876
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.5.0
>
>
> This task is to resolve the issues in HDDS-9342.
> HDDS-2680 introduced a logic in OzoneManagerStateMachine to calculate the 
> lastAppliedTermIndex based on two maps, applyTransactionMap and 
> ratisTransactionMap. Any write request from RATIS through applyTransaction 
> will add its trxLogIndex into applyTransactionMap. And any write request 
> which is flushed by OzoneManagerDoubleBuffer#flushBatch will have its 
> trxLogIndex removed from applyTransactionMap during flushBatch call 
> ozoneManagerRatisSnapShot.updateLastAppliedIndex(flushedEpochs).
> If any write request from RATIS not going through 
> OzoneManagerDoubleBuffer#flushBatch, then its trxLogIndex will be left in the
> applyTransactionMap forever. Since lastApplicedIndex can only be updated 
> incrementally, any trxLogIndex not confirmed by OzoneManagerDoubleBuffer 
> flush will make the lastApplicedIndex grow stops before it, and although 
> write requests after that unconfirmed one could be flushed, but their 
> trxLogIndex will be added to the ratisTransactionMap, which causes the 
> ratisTransactionMap grow bigger and bigger. 
> How a write request will not be confirmed by OzoneManagerDoubleBuffer flush? 
> Here is one case reproduced locally.
> T1: create bucket1
> T2: client1 sends delete bucket "bucket1" request to OM. OM verify bucket1 
> exists, then send request to RATIS to handle the request.
> T3: client2 sends create key "bucket1/key1" request to OM. OM verify bucket2 
> exists, then send request to RATIS
> T4: OzoneManagerStateMachine executes delete bucket "bucket1" success, return 
> response to client1
> T5: OzoneManagerStateMachine executes create key "bucket1/key1" request, 
> "bucket1" cannot be found, execution fails, return failure to client2
> In T5, the failure stack is
> {code:java}
> 2023-10-18 19:04:10,131 [OM StateMachine ApplyTransaction Thread - 0] WARN 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Failed to write, 
> Exception occurred 
> BUCKET_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Bucket 
> not found: s3v/prod-voyager
> at 
> org.apache.hadoop.ozone.om.OzoneManagerUtils.reportNotFound(OzoneManagerUtils.java:87)
> at 
> org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketInfo(OzoneManagerUtils.java:72)
> at 
> org.apache.hadoop.ozone.om.OzoneManagerUtils.resolveBucketInfoLink(OzoneManagerUtils.java:148)
> at 
> org.apache.hadoop.ozone.om.OzoneManagerUtils.getResolvedBucketInfo(OzoneManagerUtils.java:124)
> at 
> org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketLayout(OzoneManagerUtils.java:106)
> at 
> org.apache.hadoop.ozone.om.request.BucketLayoutAwareOMKeyRequestFactory.createRequest(BucketLayoutAwareOMKeyRequestFactory.java:230)
> at 
> org.apache.hadoop.ozone.om.ratis.utils.OzoneManagerRatisUtils.createClientRequest(OzoneManagerRatisUtils.java:336)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:380)
> at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
> at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
> In OzoneManagerStateMachine.runCommand, when IOException is throw out from 
> OzoneManagerRequestHandler.handleWriteRequest, it constructs and returns 
> OMResponse to client, it doesn't add the response into 
> OzoneManagerDoubleBuffer, so OzoneManagerDoubleBuffer doesn't aware of this 
> request and its trxLogIndex. The consequence is this trxLogIndex will be stay 
> in applyTransactionMap forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to