[ 
https://issues.apache.org/jira/browse/HDDS-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787245#comment-17787245
 ] 

Hongbing Wang commented on HDDS-9342:
-------------------------------------

[~szetszwo]  Thanks for the code. Using  online env's RocksDB and RaftLog has 
become inconvenient for debugging due to already running in a non-HA 
single-node mode. However, I have reproduced it (or a similar scenario) in the 
test environment and obtained the stack trace, as below:
{code:java}
// stack which triggered the error as follows :
          at 
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.<init>(OmKeyInfo.java:136)
          at 
org.apache.hadoop.ozone.om.helpers.OmKeyInfo$Builder.build(OmKeyInfo.java:589)
          at 
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getFromProtobuf(OmKeyInfo.java:745)
          at 
org.apache.hadoop.ozone.om.helpers.OmKeyInfo$$Lambda$100.1141264726.apply(Unknown
 Source:-1)
          at 
org.apache.hadoop.hdds.utils.db.DelegatedCodec.fromCodecBuffer(DelegatedCodec.java:91)
          at 
org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(TypedTable.java:355)
          at 
org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(TypedTable.java:318)
          at org.apache.hadoop.hdds.utils.db.TypedTable.get(TypedTable.java:228)
          at 
org.apache.hadoop.ozone.om.request.file.OMFileRequest.getOmKeyInfoFromFileTable(OMFileRequest.java:610)
          at 
org.apache.hadoop.ozone.om.request.file.OMFileRequest.getOMKeyInfoIfExists(OMFileRequest.java:669)
          at 
org.apache.hadoop.ozone.om.KeyManagerImpl.getOzoneFileStatusFSO(KeyManagerImpl.java:1314)
          at 
org.apache.hadoop.ozone.om.KeyManagerImpl.getFileStatus(KeyManagerImpl.java:1143)
          at 
org.apache.hadoop.ozone.om.KeyManagerImpl.getFileStatus(KeyManagerImpl.java:1120)
          at 
org.apache.hadoop.ozone.om.OzonePrefixPathImpl.<init>(OzonePrefixPathImpl.java:62)
          at 
org.apache.hadoop.ozone.om.request.OMClientRequest.checkACLsWithFSO(OMClientRequest.java:242)
          at 
org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:102)
          at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382)
          at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
          at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362)
          at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine$$Lambda$464.1044406904.get(Unknown
 Source:-1)
          at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
          at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745) {code}
So, I speculate that when `getFromTable` from RocksDB, the updateId is already 
a larger value. Indeed, that's the case. I parsed the RocksDB `fileTable` and 
found the file corresponding to the updateId that triggered the error, as 
follows:
{code:java}
// err info as follows :
renameKeyRequest {
  keyArgs {
    volumeName: "vol-bigdata-offline"
    bucketName: "weisuonan"
    keyName: "checkative/english-book_Ozone"
    dataSize: 0
    modificationTime: 1697627113034
  }
  toKeyName: 
".Trash/weisuonan/Current/vol-bigdata-offline/weisuonan/checkative/english-book_Ozone1697627113026"
}
 failed with exception
java.lang.IllegalArgumentException: Trying to set updateID to 206114651 which 
is not greater than the current value of 405823931 for 
OMKeyInfo{volume='vol-bigdata-offline', bucket='weisuonan', 
key='checkative/english-book_Ozone', dataSize='23124', 
creationTime='1698118516546', objectID='-9223371932963850495', 
parentID='-9223372036604311807', replication='RATIS/THREE', fileChecksum='null}
        at 
org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:107)
        at 
org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.renameKey(OMKeyRenameRequestWithFSO.java:286)
        at 
org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.validateAndUpdateCache(OMKeyRenameRequestWithFSO.java:219)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
{code}
{noformat}
// object which updateId=405823931 parsed from rocksdb fileTable :
// fileTable
}, 
"/-9223372036604313856/-9223372036604313344/-9223372036604311807/english-book_Ozone":
 {
  "volumeName": "vol-bigdata-offline",
  "bucketName": "weisuonan",
  "keyName": "english-book_Ozone",
  "dataSize": 23124,
  ...
  "parentObjectID": -9223372036604311807,
  "objectID": -9223371932963850495,
  "updateID": 405823931,
  "metadata": {}
---
// directoryTable
}, 
"/-9223372036604313856/-9223372036604313344/-9223372036604313344/checkative": {
  "name": "checkative",
  "creationTime": 1697206505369,
  "modificationTime": 1698118517702,
  ...
  ],
  "parentObjectID": -9223372036604313344,
  "objectID": -9223372036604311807,
{noformat}
So it is unreasonable for keyName: "checkative/english-book_Ozone" to 
correspond to a larger updateId.  I suspect that through the 
`create->delete->create` operations, the updateId for the file with the same 
filename got updated to a larger value.

*(2) I reproduced it in an new cluster the following way:*
{code:java}
// one ratis mode
ozone sh bucket create s3v/buk1
ozone sh key put s3v/buk1/key1 /etc/hosts
ozone sh key delete s3v/buk1/key1
ozone sh key put s3v/buk1/key1 /etc/hosts
ozone sh key put s3v/buk1/changeTransactionIndex /etc/hosts {code}
changeTransactionIndex way:
{code:java}
// OzoneManagerDoubleBuffer
// set lastRatisTransactionIndex = 3 when ozone sh key put 
s3v/buk1/changeTransactionIndex /etc/hosts 
      addToBatchTransactionInfoWithTrace(lastTraceId,
          lastRatisTransactionIndex,
          () -> omMetadataManager.getTransactionInfoTable().putWithBatch(
              batchOperation, TRANSACTION_INFO_KEY,
              new TransactionInfo.Builder()
                  .setTransactionIndex(lastRatisTransactionIndex)
                  .setCurrentTerm(term)
                  .build()));      long startTime = Time.monotonicNow();
      flushBatchWithTrace(lastTraceId, buffer.size(),
          () -> omMetadataManager.getStore()
              .commitBatchOperation(batchOperation)); {code}
then restart om, reproduced it:
{code:java}
failed with exception
java.lang.IllegalArgumentException: Trying to set updateID to 11 which is not 
greater than the current value of 13 for OMKeyInfo{volume='s3v', bucket='buk1', 
key='key1', dataSize='703233', creationTime='1700214876060', 
objectID='-9223372036854772991', parentID='-9223372036854775552', 
replication='RATIS/THREE', fileChecksum='null}
        at 
org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:107)
        at 
org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareFileInfo(OMKeyRequest.java:721)
        at 
org.apache.hadoop.ozone.om.request.key.OMKeyCreateRequestWithFSO.validateAndUpdateCache(OMKeyCreateRequestWithFSO.java:163)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
 {code}
*(3)* Could this be caused by `OzoneManagerStateMachine#takeSnapshot` updating 
TransactionInfoTable to smaller? We found that takeSnapshot was not fast enough 
to keep up with transactionId.
{noformat}
2023-10-18 17:16:03,028 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:4, i:204017881)
2023-10-18 17:37:14,031 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:4, i:204417882)
2023-10-18 17:50:11,412 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:4, i:204666693)
2023-10-18 18:09:02,208 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:5, i:205066695)
2023-10-18 18:31:53,119 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:5, i:205466696)
2023-10-18 18:53:19,399 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:5, i:205866697)
2023-10-24 11:40:08,558 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:5, i:206114616)
2023-10-24 11:40:13,779 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot 
Index (t:5, i:206114616) 
{noformat}
ratisLog:
{code:sh}
// ratis data dir:
ll | wc -l
11869
ll | head
total 194372576
-rw-rw-r-- 1 hadoop hadoop 16776283 Oct 24 12:12 log_206094558-206117379
-rw-rw-r-- 1 hadoop hadoop 16776515 Oct 24 12:12 log_206117380-206139612
-rw-rw-r-- 1 hadoop hadoop 16776192 Oct 24 12:12 log_206139613-206162121
-rw-rw-r-- 1 hadoop hadoop 16776319 Oct 24 12:12 log_206162122-206184686
-rw-rw-r-- 1 hadoop hadoop 16776531 Oct 24 12:12 log_206184687-206207228
-rw-rw-r-- 1 hadoop hadoop 16776976 Oct 24 12:12 log_206207229-206229818
-rw-rw-r-- 1 hadoop hadoop 16775986 Oct 24 12:12 log_206229819-206252483
-rw-rw-r-- 1 hadoop hadoop 16776311 Oct 24 12:12 log_206252484-206275029
-rw-rw-r-- 1 hadoop hadoop 16776682 Oct 24 12:12 log_206275030-206297665
ll | tail
-rw-rw-r-- 1 hadoop hadoop 16776408 Oct 24 12:14 log_405672200-405691320
-rw-rw-r-- 1 hadoop hadoop 16776198 Oct 24 12:14 log_405691321-405710197
-rw-rw-r-- 1 hadoop hadoop 16776874 Oct 24 12:14 log_405710198-405729231
-rw-rw-r-- 1 hadoop hadoop 16777122 Oct 24 12:14 log_405729232-405748582
-rw-rw-r-- 1 hadoop hadoop 16776624 Oct 24 12:14 log_405748583-405767797
-rw-rw-r-- 1 hadoop hadoop 16775784 Oct 24 12:14 log_405767798-405787358
-rw-rw-r-- 1 hadoop hadoop 16776092 Oct 24 12:14 log_405787359-405807241
-rw-rw-r-- 1 hadoop hadoop 14003799 Oct 24 12:12 log_inprogress_405807242
-rw-rw-r-- 1 hadoop hadoop       52 Oct 24 12:12 raft-meta
-rw-rw-r-- 1 hadoop hadoop      151 Oct 24 12:12 raft-meta.conf
{code}
 

> OM restart failed due to transactionLogIndex smaller than current updateID
> --------------------------------------------------------------------------
>
>                 Key: HDDS-9342
>                 URL: https://issues.apache.org/jira/browse/HDDS-9342
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, OM HA
>    Affects Versions: 1.3.0
>            Reporter: Hongbing Wang
>            Priority: Critical
>         Attachments: om.shutdown-20230922.log
>
>
> OM restart failed, log as follow:
> create failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863625 
> which is not greater than the current value of 2901863627 for 
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='user/xxx/platform/xxx', 
> dataSize='268435456', creationTime='1695088210914', 
> objectID='-9223371293977687808', parentID='0', replication='RATIS/THREE', 
> fileChecksum='null}
>         at 
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
>         at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareFileInfo(OMKeyRequest.java:665)
>         at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareKeyInfo(OMKeyRequest.java:623)
>         at 
> org.apache.hadoop.ozone.om.request.file.OMFileCreateRequest.validateAndUpdateCache(OMFileCreateRequest.java:255)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> rename failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863669 
> which is not greater than the current value of 3076345041 for 
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='checkative/xxx', 
> dataSize='23124', creationTime='1695380440059', 
> objectID='-9223371249310446848', parentID='0', replication='RATIS/THREE', 
> fileChecksum='null}
>         at 
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
>         at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequest.validateAndUpdateCache(OMKeyRenameRequest.java:190)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to