[
https://issues.apache.org/jira/browse/HDDS-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789422#comment-17789422
]
guangbao zhao commented on HDDS-9342:
-------------------------------------
The reason why problems like this occur is because the key to be operated
during playback is already in rocksdb, and the updateId of the key is larger
than the current transaction Id.
{code:java}
Trying to set updateID to 2901863625 which is not greater than the current
value of 2901863627 {code}
As [~wanghongbing] said above
{code:java}
1: put
2: delete
3: put {code}
This problem will occur when playback starts from 2 next time.
In our logs, we found that takeSnapshot was not executed from 2023-10-18
18:53:19 until "JVM pause detected 131.539s longer than the close-threshold
120s," appeared at 2023-10-24 11:40:08,552. shutting down..."
Then take snapshot is triggered again:
{code:java}
2023-10-24 11:40:13,773 [om2@group-197E298202B9-StateMachineUpdater] INFO
org.apache.ratis.server.impl.StateMachineUpdater:
om2@group-197E298202B9-StateMachineUpdater: Took a snapshot at index 206114616
{code}
And this operation will set the transactionInfoTable in rocksdb to: 206114616
when om exits. Then when om is started again, playback will start from this
position, and the above exception will appear.
So it is more important to find the reason why takeSnapshot is not executed.
According to the above log description, the result of getLastAppliedTermIndex()
is: 206114616
In StateMachineUpdater, shouldTakeSnapshot is used to determine whether to
execute takeSnapshot.
Because I did not find the relevant log of take snapshot in our log, it means
that shouldTakeSnapshot does not meet the true condition after 2023-10-18
18:53:19.
{code:java}
private boolean shouldTakeSnapshot() {
if (state == State.RUNNING &&
server.getSnapshotRequestHandler().shouldTriggerTakingSnapshot()) {
return true;
}
if (autoSnapshotThreshold == null) {
return false;
} else if (shouldStop()) {
return getLastAppliedIndex() - snapshotIndex.get() > 0;
}
return state == State.RUNNING &&
getStateMachineLastAppliedIndex() - snapshotIndex.get() >=
autoSnapshotThreshold;
}long getStateMachineLastAppliedIndex() {
return Optional.ofNullable(stateMachine.getLastAppliedTermIndex())
.map(TermIndex::getIndex)
.orElse(RaftLog.INVALID_LOG_INDEX);
} {code}
Then it may be that the following conditions are not met:
{code:java}
getStateMachineLastAppliedIndex() - snapshotIndex.get() >=
autoSnapshotThreshold {code}
In Ozone, when the transaction in doubleBuffer is written to rocksdb,
updateLastAppliedIndex of ozoneManagerStateMachine is called to update
lastAppliedTermIndex. So is there a problem with the conditions here, causing
lastAppliedTermIndex not to be updated or the update to be very slower, which
does not meet the conditions of takeSnapshot?
So do you have any better troubleshooting ideas?[~Sammi] [~szetszwo]
(cc [~wanghongbing] )
> OM restart failed due to transactionLogIndex smaller than current updateID
> --------------------------------------------------------------------------
>
> Key: HDDS-9342
> URL: https://issues.apache.org/jira/browse/HDDS-9342
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM, OM HA
> Affects Versions: 1.3.0
> Reporter: Hongbing Wang
> Priority: Critical
> Attachments: clipboard_image_1700795744614.png,
> om.shutdown-20230922.log
>
>
> OM restart failed, log as follow:
> create failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863625
> which is not greater than the current value of 2901863627 for
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='user/xxx/platform/xxx',
> dataSize='268435456', creationTime='1695088210914',
> objectID='-9223371293977687808', parentID='0', replication='RATIS/THREE',
> fileChecksum='null}
> at
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareFileInfo(OMKeyRequest.java:665)
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareKeyInfo(OMKeyRequest.java:623)
> at
> org.apache.hadoop.ozone.om.request.file.OMFileCreateRequest.validateAndUpdateCache(OMFileCreateRequest.java:255)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> rename failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863669
> which is not greater than the current value of 3076345041 for
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='checkative/xxx',
> dataSize='23124', creationTime='1695380440059',
> objectID='-9223371249310446848', parentID='0', replication='RATIS/THREE',
> fileChecksum='null}
> at
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequest.validateAndUpdateCache(OMKeyRenameRequest.java:190)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]