[jira] [Commented] (HDDS-9342) OM restart failed due to transactionLogIndex smaller than current updateID

guangbao zhao (Jira) Fri, 24 Nov 2023 02:22:05 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789422#comment-17789422
 ]


guangbao zhao commented on HDDS-9342:
-------------------------------------

The reason why problems like this occur is because the key to be operated 
during playback is already in rocksdb, and the updateId of the key is larger 
than the current transaction Id.
{code:java}
Trying to set updateID to 2901863625 which is not greater than the current 
value of 2901863627 {code}
As [~wanghongbing]  said above
{code:java}
1: put
2: delete
3: put {code}
This problem will occur when playback starts from 2 next time.

In our logs, we found that takeSnapshot was not executed from 2023-10-18 
18:53:19 until "JVM pause detected 131.539s longer than the close-threshold 
120s," appeared at 2023-10-24 11:40:08,552. shutting down..."

Then take snapshot is triggered again:
{code:java}
2023-10-24 11:40:13,773 [om2@group-197E298202B9-StateMachineUpdater] INFO 
org.apache.ratis.server.impl.StateMachineUpdater: 
om2@group-197E298202B9-StateMachineUpdater: Took a snapshot at index 206114616 
{code}
And this operation will set the transactionInfoTable in rocksdb to: 206114616 
when om exits. Then when om is started again, playback will start from this 
position, and the above exception will appear.

So it is more important to find the reason why takeSnapshot is not executed.

According to the above log description, the result of getLastAppliedTermIndex() 
is: 206114616

In StateMachineUpdater, shouldTakeSnapshot is used to determine whether to 
execute takeSnapshot.

Because I did not find the relevant log of take snapshot in our log, it means 
that shouldTakeSnapshot does not meet the true condition after 2023-10-18 
18:53:19.
{code:java}
private boolean shouldTakeSnapshot() {
  if (state == State.RUNNING && 
server.getSnapshotRequestHandler().shouldTriggerTakingSnapshot()) {
    return true;
  }
  if (autoSnapshotThreshold == null) {
    return false;
  } else if (shouldStop()) {
    return getLastAppliedIndex() - snapshotIndex.get() > 0;
  }
  return state == State.RUNNING &&
      getStateMachineLastAppliedIndex() - snapshotIndex.get() >= 
autoSnapshotThreshold;
}long getStateMachineLastAppliedIndex() {
  return Optional.ofNullable(stateMachine.getLastAppliedTermIndex())
      .map(TermIndex::getIndex)
      .orElse(RaftLog.INVALID_LOG_INDEX);
} {code}
Then it may be that the following conditions are not met:
{code:java}
getStateMachineLastAppliedIndex() - snapshotIndex.get() >= 
autoSnapshotThreshold {code}
In Ozone, when the transaction in doubleBuffer is written to rocksdb, 
updateLastAppliedIndex of ozoneManagerStateMachine is called to update 
lastAppliedTermIndex. So is there a problem with the conditions here, causing 
lastAppliedTermIndex not to be updated or the update to be very slower, which 
does not meet the conditions of takeSnapshot？

So do you have any better troubleshooting ideas?[~Sammi] [~szetszwo] 

（cc [~wanghongbing] ）

> OM restart failed due to transactionLogIndex smaller than current updateID
> --------------------------------------------------------------------------
>
>                 Key: HDDS-9342
>                 URL: https://issues.apache.org/jira/browse/HDDS-9342
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, OM HA
>    Affects Versions: 1.3.0
>            Reporter: Hongbing Wang
>            Priority: Critical
>         Attachments: clipboard_image_1700795744614.png, 
> om.shutdown-20230922.log
>
>
> OM restart failed, log as follow:
> create failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863625 
> which is not greater than the current value of 2901863627 for 
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='user/xxx/platform/xxx', 
> dataSize='268435456', creationTime='1695088210914', 
> objectID='-9223371293977687808', parentID='0', replication='RATIS/THREE', 
> fileChecksum='null}
>         at 
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
>         at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareFileInfo(OMKeyRequest.java:665)
>         at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareKeyInfo(OMKeyRequest.java:623)
>         at 
> org.apache.hadoop.ozone.om.request.file.OMFileCreateRequest.validateAndUpdateCache(OMFileCreateRequest.java:255)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> rename failed:
> {noformat}
> java.lang.IllegalArgumentException: Trying to set updateID to 2901863669 
> which is not greater than the current value of 3076345041 for 
> OMKeyInfo{volume='vol-xxx', bucket='xxx', key='checkative/xxx', 
> dataSize='23124', creationTime='1695380440059', 
> objectID='-9223371249310446848', parentID='0', replication='RATIS/THREE', 
> fileChecksum='null}
>         at 
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105)
>         at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequest.validateAndUpdateCache(OMKeyRenameRequest.java:190)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-9342) OM restart failed due to transactionLogIndex smaller than current updateID

Reply via email to