[
https://issues.apache.org/jira/browse/HDDS-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787245#comment-17787245
]
Hongbing Wang commented on HDDS-9342:
-------------------------------------
[~szetszwo] Thanks for the code. Using online env's RocksDB and RaftLog has
become inconvenient for debugging due to already running in a non-HA
single-node mode. However, I have reproduced it (or a similar scenario) in the
test environment and obtained the stack trace, as below:
{code:java}
// stack which triggered the error as follows :
at
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.<init>(OmKeyInfo.java:136)
at
org.apache.hadoop.ozone.om.helpers.OmKeyInfo$Builder.build(OmKeyInfo.java:589)
at
org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getFromProtobuf(OmKeyInfo.java:745)
at
org.apache.hadoop.ozone.om.helpers.OmKeyInfo$$Lambda$100.1141264726.apply(Unknown
Source:-1)
at
org.apache.hadoop.hdds.utils.db.DelegatedCodec.fromCodecBuffer(DelegatedCodec.java:91)
at
org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(TypedTable.java:355)
at
org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(TypedTable.java:318)
at org.apache.hadoop.hdds.utils.db.TypedTable.get(TypedTable.java:228)
at
org.apache.hadoop.ozone.om.request.file.OMFileRequest.getOmKeyInfoFromFileTable(OMFileRequest.java:610)
at
org.apache.hadoop.ozone.om.request.file.OMFileRequest.getOMKeyInfoIfExists(OMFileRequest.java:669)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.getOzoneFileStatusFSO(KeyManagerImpl.java:1314)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.getFileStatus(KeyManagerImpl.java:1143)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.getFileStatus(KeyManagerImpl.java:1120)
at
org.apache.hadoop.ozone.om.OzonePrefixPathImpl.<init>(OzonePrefixPathImpl.java:62)
at
org.apache.hadoop.ozone.om.request.OMClientRequest.checkACLsWithFSO(OMClientRequest.java:242)
at
org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:102)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine$$Lambda$464.1044406904.get(Unknown
Source:-1)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) {code}
So, I speculate that when `getFromTable` from RocksDB, the updateId is already
a larger value. Indeed, that's the case. I parsed the RocksDB `fileTable` and
found the file corresponding to the updateId that triggered the error, as
follows:
{code:java}
// err info as follows :
renameKeyRequest {
keyArgs {
volumeName: "vol-bigdata-offline"
bucketName: "weisuonan"
keyName: "checkative/english-book_Ozone"
dataSize: 0
modificationTime: 1697627113034
}
toKeyName:
".Trash/weisuonan/Current/vol-bigdata-offline/weisuonan/checkative/english-book_Ozone1697627113026"
}
failed with exception
java.lang.IllegalArgumentException: Trying to set updateID to 206114651 which
is not greater than the current value of 405823931 for
OMKeyInfo{volume='vol-bigdata-offline', bucket='weisuonan',
key='checkative/english-book_Ozone', dataSize='23124',
creationTime='1698118516546', objectID='-9223371932963850495',
parentID='-9223372036604311807', replication='RATIS/THREE', fileChecksum='null}
at
org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:107)
at
org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.renameKey(OMKeyRenameRequestWithFSO.java:286)
at
org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.validateAndUpdateCache(OMKeyRenameRequestWithFSO.java:219)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
{code}
{noformat}
// object which updateId=405823931 parsed from rocksdb fileTable :
// fileTable
},
"/-9223372036604313856/-9223372036604313344/-9223372036604311807/english-book_Ozone":
{
"volumeName": "vol-bigdata-offline",
"bucketName": "weisuonan",
"keyName": "english-book_Ozone",
"dataSize": 23124,
...
"parentObjectID": -9223372036604311807,
"objectID": -9223371932963850495,
"updateID": 405823931,
"metadata": {}
---
// directoryTable
},
"/-9223372036604313856/-9223372036604313344/-9223372036604313344/checkative": {
"name": "checkative",
"creationTime": 1697206505369,
"modificationTime": 1698118517702,
...
],
"parentObjectID": -9223372036604313344,
"objectID": -9223372036604311807,
{noformat}
So it is unreasonable for keyName: "checkative/english-book_Ozone" to
correspond to a larger updateId. I suspect that through the
`create->delete->create` operations, the updateId for the file with the same
filename got updated to a larger value.
*(2) I reproduced it in an new cluster the following way:*
{code:java}
// one ratis mode
ozone sh bucket create s3v/buk1
ozone sh key put s3v/buk1/key1 /etc/hosts
ozone sh key delete s3v/buk1/key1
ozone sh key put s3v/buk1/key1 /etc/hosts
ozone sh key put s3v/buk1/changeTransactionIndex /etc/hosts {code}
changeTransactionIndex way:
{code:java}
// OzoneManagerDoubleBuffer
// set lastRatisTransactionIndex = 3 when ozone sh key put
s3v/buk1/changeTransactionIndex /etc/hosts
addToBatchTransactionInfoWithTrace(lastTraceId,
lastRatisTransactionIndex,
() -> omMetadataManager.getTransactionInfoTable().putWithBatch(
batchOperation, TRANSACTION_INFO_KEY,
new TransactionInfo.Builder()
.setTransactionIndex(lastRatisTransactionIndex)
.setCurrentTerm(term)
.build())); long startTime = Time.monotonicNow();
flushBatchWithTrace(lastTraceId, buffer.size(),
() -> omMetadataManager.getStore()
.commitBatchOperation(batchOperation)); {code}
then restart om, reproduced it:
{code:java}
failed with exception
java.lang.IllegalArgumentException: Trying to set updateID to 11 which is not
greater than the current value of 13 for OMKeyInfo{volume='s3v', bucket='buk1',
key='key1', dataSize='703233', creationTime='1700214876060',
objectID='-9223372036854772991', parentID='-9223372036854775552',
replication='RATIS/THREE', fileChecksum='null}
at
org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:107)
at
org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareFileInfo(OMKeyRequest.java:721)
at
org.apache.hadoop.ozone.om.request.key.OMKeyCreateRequestWithFSO.validateAndUpdateCache(OMKeyCreateRequestWithFSO.java:163)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
{code}
*(3)* Could this be caused by `OzoneManagerStateMachine#takeSnapshot` updating
TransactionInfoTable to smaller? We found that takeSnapshot was not fast enough
to keep up with transactionId.
{noformat}
2023-10-18 17:16:03,028 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:4, i:204017881)
2023-10-18 17:37:14,031 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:4, i:204417882)
2023-10-18 17:50:11,412 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:4, i:204666693)
2023-10-18 18:09:02,208 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:5, i:205066695)
2023-10-18 18:31:53,119 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:5, i:205466696)
2023-10-18 18:53:19,399 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:5, i:205866697)
2023-10-24 11:40:08,558 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:5, i:206114616)
2023-10-24 11:40:13,779 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Current Snapshot
Index (t:5, i:206114616)
{noformat}
ratisLog:
{code:sh}
// ratis data dir:
ll | wc -l
11869
ll | head
total 194372576
-rw-rw-r-- 1 hadoop hadoop 16776283 Oct 24 12:12 log_206094558-206117379
-rw-rw-r-- 1 hadoop hadoop 16776515 Oct 24 12:12 log_206117380-206139612
-rw-rw-r-- 1 hadoop hadoop 16776192 Oct 24 12:12 log_206139613-206162121
-rw-rw-r-- 1 hadoop hadoop 16776319 Oct 24 12:12 log_206162122-206184686
-rw-rw-r-- 1 hadoop hadoop 16776531 Oct 24 12:12 log_206184687-206207228
-rw-rw-r-- 1 hadoop hadoop 16776976 Oct 24 12:12 log_206207229-206229818
-rw-rw-r-- 1 hadoop hadoop 16775986 Oct 24 12:12 log_206229819-206252483
-rw-rw-r-- 1 hadoop hadoop 16776311 Oct 24 12:12 log_206252484-206275029
-rw-rw-r-- 1 hadoop hadoop 16776682 Oct 24 12:12 log_206275030-206297665
ll | tail
-rw-rw-r-- 1 hadoop hadoop 16776408 Oct 24 12:14 log_405672200-405691320
-rw-rw-r-- 1 hadoop hadoop 16776198 Oct 24 12:14 log_405691321-405710197
-rw-rw-r-- 1 hadoop hadoop 16776874 Oct 24 12:14 log_405710198-405729231
-rw-rw-r-- 1 hadoop hadoop 16777122 Oct 24 12:14 log_405729232-405748582
-rw-rw-r-- 1 hadoop hadoop 16776624 Oct 24 12:14 log_405748583-405767797
-rw-rw-r-- 1 hadoop hadoop 16775784 Oct 24 12:14 log_405767798-405787358
-rw-rw-r-- 1 hadoop hadoop 16776092 Oct 24 12:14 log_405787359-405807241
-rw-rw-r-- 1 hadoop hadoop 14003799 Oct 24 12:12 log_inprogress_405807242
-rw-rw-r-- 1 hadoop hadoop 52 Oct 24 12:12 raft-meta
-rw-rw-r-- 1 hadoop hadoop 151 Oct 24 12:12 raft-meta.conf
{code}
> OM restart failed due to transactionLogIndex smaller than current updateID
> --------------------------------------------------------------------------
>
> Key: HDDS-9342
> URL: https://issues.apache.org/jira/browse/HDDS-9342
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM, OM HA
> Affects Versions: 1.3.0
> Reporter: Hongbing Wang
> Priority: Critical
> Attachments: om.shutdown-20230922.log
>
>
> OM restart failed, log as follow:
> create failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863625
> which is not greater than the current value of 2901863627 for
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='user/xxx/platform/xxx',
> dataSize='268435456', creationTime='1695088210914',
> objectID='-9223371293977687808', parentID='0', replication='RATIS/THREE',
> fileChecksum='null}
> at
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareFileInfo(OMKeyRequest.java:665)
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareKeyInfo(OMKeyRequest.java:623)
> at
> org.apache.hadoop.ozone.om.request.file.OMFileCreateRequest.validateAndUpdateCache(OMFileCreateRequest.java:255)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> rename failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863669
> which is not greater than the current value of 3076345041 for
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='checkative/xxx',
> dataSize='23124', creationTime='1695380440059',
> objectID='-9223371249310446848', parentID='0', replication='RATIS/THREE',
> fileChecksum='null}
> at
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequest.validateAndUpdateCache(OMKeyRenameRequest.java:190)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]