[jira] [Commented] (HADOOP-19281) MetricsSystemImpl should not print INFO message in CLI
[ https://issues.apache.org/jira/browse/HADOOP-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883313#comment-17883313 ] Tsz-wo Sze commented on HADOOP-19281: - [~sarvekshayr], thanks for checking. Could you try "hadoop fs" such as the command below? {code} hadoop fs -Dfs.s3a.bucket.probe=0 -Dfs.s3a.change.detection.version.required=false -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 -Dfs.s3a.endpoint=http://some.site:9878 -Dfs.s3a.path.style.access=true -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -ls -R s3a://bucket1/ {code} You are right that this is a problem in Hadoop but not Ozone. Moved this to Hadoop Common. > MetricsSystemImpl should not print INFO message in CLI > -- > > Key: HADOOP-19281 > URL: https://issues.apache.org/jira/browse/HADOOP-19281 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Tsz-wo Sze >Priority: Major > Labels: newbie > > Below is an example: > {code} > # hadoop fs -Dfs.s3a.bucket.probe=0 > -Dfs.s3a.change.detection.version.required=false > -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 > -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 > -Dfs.s3a.endpoint=http://some.site:9878 -Dfs.s3a.path.style.access=true > -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -ls -R s3a://bucket1/ > 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot > period at 10 second(s). > 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > started > 24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option > fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 > ms instead > 24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an > instance of MultipartS3AsyncClient, and thus multipart download feature is > not enabled. To benefit from all features, consider using > S3AsyncClient.crtBuilder().build() instead. > drwxrwxrwx - root root 0 2024-09-17 10:47 s3a://bucket1/dir1 > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > stopped. > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Assigned] (HADOOP-19281) MetricsSystemImpl should not print INFO message in CLI
[ https://issues.apache.org/jira/browse/HADOOP-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned HADOOP-19281: --- Component/s: metrics (was: metrics) Key: HADOOP-19281 (was: HDDS-11466) Workflow: no-reopen-closed, patch-avail (was: patch-available, re-open possible) Assignee: (was: Sarveksha Yeshavantha Raju) Project: Hadoop Common (was: Apache Ozone) > MetricsSystemImpl should not print INFO message in CLI > -- > > Key: HADOOP-19281 > URL: https://issues.apache.org/jira/browse/HADOOP-19281 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Tsz-wo Sze >Priority: Major > Labels: newbie > > Below is an example: > {code} > # hadoop fs -Dfs.s3a.bucket.probe=0 > -Dfs.s3a.change.detection.version.required=false > -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 > -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 > -Dfs.s3a.endpoint=http://some.site:9878 -Dfs.s3a.path.style.access=true > -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -ls -R s3a://bucket1/ > 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot > period at 10 second(s). > 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > started > 24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option > fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 > ms instead > 24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an > instance of MultipartS3AsyncClient, and thus multipart download feature is > not enabled. To benefit from all features, consider using > S3AsyncClient.crtBuilder().build() instead. > drwxrwxrwx - root root 0 2024-09-17 10:47 s3a://bucket1/dir1 > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > stopped. > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (RATIS-2160) MetricRegistriesLoader should not print INFO message in CLI
Tsz-wo Sze created RATIS-2160: - Summary: MetricRegistriesLoader should not print INFO message in CLI Key: RATIS-2160 URL: https://issues.apache.org/jira/browse/RATIS-2160 Project: Ratis Issue Type: Bug Components: shell Reporter: Tsz-wo Sze MetricRegistriesLoader uses MetricRegistries log to print the following INFO message. {code} 2024-09-19 10:56:43 INFO MetricRegistries:64 - Loaded MetricRegistries class org.apache.ratis.metrics.impl.MetricRegistriesImpl {code} Note that "MetricRegistries:64" is very misleading since 64 actually is the line 64 in MetricRegistriesLoader, not MetricRegistries. {code} //line 64 in MetricRegistriesLoader LOG.info("Loaded MetricRegistries " + impl.getClass()); {code} When there are multiple implementations, it will print the following WARN message instead. {code} [main] WARN org.apache.ratis.metrics.MetricRegistries - Found multiple MetricRegistries implementations: class org.apache.ratis.metrics.impl.MetricRegistriesImpl, class org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl. Using first found implementation: org.apache.ratis.metrics.impl.MetricRegistriesImpl@4d1b0d2a {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HDDS-11470) OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed
[ https://issues.apache.org/jira/browse/HDDS-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883060#comment-17883060 ] Tsz-wo Sze commented on HDDS-11470: --- Below is an example that om2 replied "Completed INSTALL_SNAPSHOT" but it actually had failed to move downloaded DB checkpoint due to a UnixException "Invalid cross-device link". {code} 2024-09-18 09:28:20,365 INFO [grpc-default-executor-1]-org.apache.ratis.grpc.server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastReply: null 2024-09-18 09:28:20,365 INFO [pool-33-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: metadataManager is stopped. Spend 7 ms. 2024-09-18 09:28:20,367 ERROR [pool-33-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Failed to move downloaded DB checkpoint /var/lib/hadoop-ozone/om/ozone-metaot/om.db.candidate to metadata directory /ozone/hadoop-ozone/om/data/om.db. Exception: {}. Resetting to original DB. java.nio.file.FileSystemException: /ozone/hadoop-ozone/om/data/om.db/44.sst -> /var/lib/hadoop-ozone/om/ozone-metadata/snapshot/om.db.candidate/44.sst: Invalid cross-device link at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:477) at java.base/java.nio.file.Files.createLink(Files.java:1101) at org.apache.hadoop.ozone.om.snapshot.OmSnapshotUtils.linkFiles(OmSnapshotUtils.java:169) at org.apache.hadoop.ozone.om.OzoneManager.moveCheckpointFiles(OzoneManager.java:3884) at org.apache.hadoop.ozone.om.OzoneManager.replaceOMDBWithCheckpoint(OzoneManager.java:3864) at org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3738) at org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3673) at org.apache.hadoop.ozone.om.OzoneManager.installSnapshotFromLeader(OzoneManager.java:3650) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$5(OzoneManagerStateMachine.java:505) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) {code} > OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed > > > Key: HDDS-11470 > URL: https://issues.apache.org/jira/browse/HDDS-11470 > Project: Apache Ozone > Issue Type: Bug > Components: OM HA >Reporter: Tsz-wo Sze >Priority: Major > > When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply > "Completed INSTALL_SNAPSHOT". > In the code below, when there is an exception, it just print an error message > and continue to reply "Completed INSTALL_SNAPSHOT". > {code} > //OzoneManager.installCheckpoint > try { > time = Time.monotonicNow(); > dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, > oldDBLocation, checkpointLocation); > term = checkpointTrxnInfo.getTerm(); > lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex(); > LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, " + > "index: {}, time: {} ms", leaderId, term, lastAppliedIndex, > Time.monotonicNow() - time); > } catch (Exception e) { > LOG.error("Failed to install Snapshot from {} as OM failed to > replace" + > " DB with downloaded checkpoint. Reloading old OM state.", > leaderId, e); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Updated] (RATIS-2145) Follower hangs until the next trigger to take a snapshot
[ https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2145: -- Fix Version/s: (was: 3.2.0) > Follower hangs until the next trigger to take a snapshot > - > > Key: RATIS-2145 > URL: https://issues.apache.org/jira/browse/RATIS-2145 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Assignee: guangbao zhao >Priority: Major > Fix For: 3.1.1 > > Time Spent: 20m > Remaining Estimate: 0h > > We discovered a problem when writing tests with high concurrency. It often > happens that a follower is running well and then triggers takeSnalshot. > The following is the relevant log. > follower: (as the follower log says, between 2024/08/22 20:18:14,044 and > 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the > follower election was not triggered, indicating that the leader gave The > heartbeat sent by the follower is successful) > {code:java} > 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO > org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly > 22436696498 -> 22441096501 > 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: > node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment > /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615 > 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax > old=22432683959, new=22437078979, updated? true > 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO > com.xxx.RaftJournalManager: Received install snapshot notification from > MetaStore leader: node3 with term index: (t:192, i:22441477801) > 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO > com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from > Leader MetaStore node3. Checkpoint address: leader:8170 > 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, > i:22441477801) > 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/22 20:21:57,067 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed > appendEntries as snapshot (22441477801) installation is in progress > 2024/08/22 20:21:57,068 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > inconsistency entries. > Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code} > leader: > {code:java} > 2024/08/22 20:18:16,958 [timer5] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, > i:22441098598)...(t:192, i:22441098622) > 2024/08/22 20:18:16,964 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, > i:22441098624) > 2024/08/22 20:18:16,964 [timer6] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, > i:22441098625) > 2024/08/22 20:18:16,964 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, > i:22441098623) > 2024/08/22 20:18:16,965 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192, > i:22441098627) > 2024/08/22 20:18:16,965 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=168
[jira] [Updated] (RATIS-2146) Fixed possible issues caused by concurrent deletion and election when member changes
[ https://issues.apache.org/jira/browse/RATIS-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2146: -- Fix Version/s: (was: 3.2.0) > Fixed possible issues caused by concurrent deletion and election when member > changes > > > Key: RATIS-2146 > URL: https://issues.apache.org/jira/browse/RATIS-2146 > Project: Ratis > Issue Type: Improvement > Components: server >Reporter: Xinyu Tan >Assignee: Xinyu Tan >Priority: Major > Fix For: 3.1.1 > > Attachments: image-2024-08-28-14-53-23-259.png, > image-2024-08-28-14-53-27-637.png > > Time Spent: 1h > Remaining Estimate: 0h > > During this process, we encountered some concurrency issues: > * After the member change is complete, node D will no longer be a member of > this consensus group. It will attempt to initiate an election but receive a > NOT_IN_CONF response, after which it will close itself. > * During the removal of member D, it will also close itself first, and then > proceed to delete the file directory. > These two CLOSE operations may occur concurrently, which could result in the > directory being deleted while the StateMachineUpdater thread has not yet > closed, ultimately leading to unexpected errors. > !image-2024-08-28-14-53-23-259.png! > !image-2024-08-28-14-53-27-637.png! > I believe there are two possible solutions for this issue: > * Add concurrency control to the close function, such as adding the > synchronized keyword to the function. > * Add some checks before deleting the directory to ensure that the callback > functions in the close process have already been executed before the > directory is deleted. > What's your opinion? [~szetszwo] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2153) ratis-version.properties missing from src bundle
[ https://issues.apache.org/jira/browse/RATIS-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2153: -- Fix Version/s: (was: 3.2.0) > ratis-version.properties missing from src bundle > > > Key: RATIS-2153 > URL: https://issues.apache.org/jira/browse/RATIS-2153 > Project: Ratis > Issue Type: Bug > Components: build >Affects Versions: 3.1.1, 3.2.0 >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Blocker > Fix For: 3.1.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > RATIS-1840 added {{src/main/resources/ratis-version.properties}} in root > module. This file is missing from the {{src}} assembly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2149) Do not perform leader election if the current RaftServer has not started yet
[ https://issues.apache.org/jira/browse/RATIS-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2149: -- Fix Version/s: 3.1.1 (was: 3.2.0) > Do not perform leader election if the current RaftServer has not started yet > > > Key: RATIS-2149 > URL: https://issues.apache.org/jira/browse/RATIS-2149 > Project: Ratis > Issue Type: Improvement > Components: election >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.1.1 > > Attachments: image-2024-09-03-17-41-41-872.png, > image-2024-09-03-18-13-50-628.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Sometimes we cannot guarantee that the program will run normally in various > environments, and appropriate robustness enhancement may be necessary. > Before adding members, RaftServer S and the corresponding group will be > created if the group is not exist and find that the interval between these > two logs is more than one minute. > !image-2024-09-03-17-41-41-872.png! > > Since our RpcTimeout is small 1 minute, the retryPolicy has already started, > but S's groupId is already in the implMaps of RaftServerProxy, which will > throw AlreadyExistException. When we catch this exception, we assume that the > creation has been completed and the member change can be executed > > S is still in the initializing state, and this member change will not be > completed. Finally, we found that S started the election and received > NOT_IN_CONF reply, and then S will be closed > !image-2024-09-03-18-13-50-628.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2148) Snapshot transfer may cause followers to trigger reloadStateMachine incorrectly
[ https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2148: -- Fix Version/s: 3.1.1 (was: 3.2.0) > Snapshot transfer may cause followers to trigger reloadStateMachine > incorrectly > --- > > Key: RATIS-2148 > URL: https://issues.apache.org/jira/browse/RATIS-2148 > Project: Ratis > Issue Type: Bug > Components: snapshot >Affects Versions: 3.1.0 >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.1.1 > > Attachments: image-2024-09-03-14-24-25-652.png, > image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, > image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, > image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Due to the fact that grpc streaming snapshot sending sends all requests at > once, error handling is performed after all are sent, and the last snapshot > request is used as a completion flag, which may lead to the successful > receipt of the last request, but the previous request has failed. The sender > handles the failure event during the retransmission of the snapshot. The > receiver triggers state.reloadStateMachine because it successfully receives > the last request, but due to incomplete snapshot reception > > An md5 mismatch exception occurred before the last SnapshotRequest was > received > !image-2024-09-03-14-27-39-406.png! > > The last snapshot request arrived, then successfully received, and then > updated the index. > !image-2024-09-03-14-28-31-529.png! > !image-2024-09-03-14-30-02-751.png! > > However, the snapshot reception is incomplete and triggers the > reloadStateMachine. > !image-2024-09-03-14-33-49-573.png! > > I suggest using a flag to identify whether the entire snapshot request is > abnormal. > If an exception occurs, the subsequent content of the request will not be > processed. > Or the sender will wait for the receiver's reply. If there is a release > error, resend it. > > Finally, the current error retry level is the entire snapshot directory > rather than a single chunk, which will cause a large number of snapshot files > to be sent repeatedly, which can be optimized later -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2154) The old leader may send appendEntries after term changed
[ https://issues.apache.org/jira/browse/RATIS-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2154: -- Fix Version/s: 3.1.1 (was: 3.2.0) > The old leader may send appendEntries after term changed > > > Key: RATIS-2154 > URL: https://issues.apache.org/jira/browse/RATIS-2154 > Project: Ratis > Issue Type: Wish > Components: Leader >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.1.1 > > Attachments: image-2024-09-12-09-43-30-670.png > > Time Spent: 20m > Remaining Estimate: 0h > > The leader will become a follower after receiving a higher term, but during > this process, the old leader may be appending LogEntry, and the error log > will be printed until LogAppenderDaemon is closed. > !image-2024-09-12-09-43-30-670.png! > > I think we can put state.updateCurrentTerm (newTerm) later. Close LeaderState > first before updating the term, and other operations remain unchanged. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2150) No need for manual assembly:single execution when mvn depoly
[ https://issues.apache.org/jira/browse/RATIS-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2150: -- Fix Version/s: 3.1.1 > No need for manual assembly:single execution when mvn depoly > > > Key: RATIS-2150 > URL: https://issues.apache.org/jira/browse/RATIS-2150 > Project: Ratis > Issue Type: Improvement > Components: build >Reporter: Xinyu Tan >Assignee: Xinyu Tan >Priority: Major > Fix For: 3.1.1 > > Time Spent: 40m > Remaining Estimate: 0h > > This [RATIS-2117|https://issues.apache.org/jira/browse/RATIS-2117] ignores > the mvn deoply command update, which will be updated in this ISSUE -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2152) GrpcLogAppender stucks while sending an installSnapshot notification request
[ https://issues.apache.org/jira/browse/RATIS-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2152: -- Fix Version/s: 3.1.1 (was: 3.2.0) > GrpcLogAppender stucks while sending an installSnapshot notification request > > > Key: RATIS-2152 > URL: https://issues.apache.org/jira/browse/RATIS-2152 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Chung En Lee >Assignee: Chung En Lee >Priority: Major > Fix For: 3.1.1 > > Time Spent: 1h > Remaining Estimate: 0h > > In `GrpcLogAppender`, it waits for signal at the end of > `notifyInstallSnapshot` as following. > [https://github.com/apache/ratis/blob/master/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L825-L831] > However, checking whether the `InstallSnapshotResponseHandler` is done and > the call `AwaitForSignal.await()` are not atomic. This creates a potential > race condition where InstallSnapshotResponseHandler.close() could finish > after the check but before the wait, causing that `GrpcLogAppender` is still > waiting even though `InstallSnapshotResponseHandler` has already completed, > leading to timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2080) Reuse LeaderElection executor
[ https://issues.apache.org/jira/browse/RATIS-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2080. --- Fix Version/s: (was: 3.1.0) Resolution: Won't Do Resolving this as "Won't Do". > Reuse LeaderElection executor > - > > Key: RATIS-2080 > URL: https://issues.apache.org/jira/browse/RATIS-2080 > Project: Ratis > Issue Type: Improvement > Components: election >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > When running TestRaftWithNetty#testWithLoad with 5 servers, there were 110 > leader election threads as shown below. We should reuse the vote executor. > {code} > $cat threaddump.txt | grep "C-LeaderElection" | wc > 1101320 25621 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2147) MD5 mismatch when accept snapshot
[ https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2147. --- Fix Version/s: 3.1.1 Resolution: Fixed The pull request was merged. Thanks, [~tohsakarin__]! > MD5 mismatch when accept snapshot > - > > Key: RATIS-2147 > URL: https://issues.apache.org/jira/browse/RATIS-2147 > Project: Ratis > Issue Type: Bug > Components: snapshot >Affects Versions: 3.1.0, 3.2.0 >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.1.1 > > Attachments: image-2024-09-03-10-35-08-315.png, > image-2024-09-03-10-35-28-617.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > We encountered an MD5 mismatch issue in IoTDB, and after multiple > investigations, we found that the digester was contaminated > > We have checked that it is not a network and disk problem > > In implementation, the received snapshot will be written to a temporary file > first. If there is an md5 mismatch, we will read the data from this temporary > file and use a new digest to calculate md5, but the result of this > calculation is the same as the md5 hash value sent > !image-2024-09-03-10-35-28-617.png! > > !image-2024-09-03-10-35-08-315.png! > > > Use the saved corrupted file name to locate the relevant log, here to > tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735 > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA! > Before encountering corrupt, the sender sent several consecutive snapshot > installation requests to the receiver. > > The receiver successfully received some requests, and then encountered a > request for corrupt, and began printing "recompute again" to start > recalculating. > > After execution, the ERROR log of the rename will be printed, and the data > will be read from the file and compared with the received chunk data. > > If a byte does not match, the corresponding information will be printed, but > no log information will be printed, which means that the content written to > the disk is the same as the content sent > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA! > This makes the problem very clear. There is a problem with the MD5 > calculation class, and the reasons are as follows: > > If a byte in the middle of the data part is incorrect due to network > reasons, the calculated result and the hash sent must be different > > If there is a problem with the part that stores the hash value, the final > calculation result will also be different. > > I suggest creating a new digest every time follower receive a snapshot, so as > to avoid pollution problems. Under normal network and disk conditions, > Corrupt will not occur -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2147) MD5 mismatch when accept snapshot
[ https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2147: -- Affects Version/s: (was: 3.1.0) (was: 3.2.0) > MD5 mismatch when accept snapshot > - > > Key: RATIS-2147 > URL: https://issues.apache.org/jira/browse/RATIS-2147 > Project: Ratis > Issue Type: Bug > Components: snapshot >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.1.1 > > Attachments: image-2024-09-03-10-35-08-315.png, > image-2024-09-03-10-35-28-617.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > We encountered an MD5 mismatch issue in IoTDB, and after multiple > investigations, we found that the digester was contaminated > > We have checked that it is not a network and disk problem > > In implementation, the received snapshot will be written to a temporary file > first. If there is an md5 mismatch, we will read the data from this temporary > file and use a new digest to calculate md5, but the result of this > calculation is the same as the md5 hash value sent > !image-2024-09-03-10-35-28-617.png! > > !image-2024-09-03-10-35-08-315.png! > > > Use the saved corrupted file name to locate the relevant log, here to > tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735 > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA! > Before encountering corrupt, the sender sent several consecutive snapshot > installation requests to the receiver. > > The receiver successfully received some requests, and then encountered a > request for corrupt, and began printing "recompute again" to start > recalculating. > > After execution, the ERROR log of the rename will be printed, and the data > will be read from the file and compared with the received chunk data. > > If a byte does not match, the corresponding information will be printed, but > no log information will be printed, which means that the content written to > the disk is the same as the content sent > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA! > This makes the problem very clear. There is a problem with the MD5 > calculation class, and the reasons are as follows: > > If a byte in the middle of the data part is incorrect due to network > reasons, the calculated result and the hash sent must be different > > If there is a problem with the part that stores the hash value, the final > calculation result will also be different. > > I suggest creating a new digest every time follower receive a snapshot, so as > to avoid pollution problems. Under normal network and disk conditions, > Corrupt will not occur -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2157) Enhance make_rc.sh for non-first rc at release time
[ https://issues.apache.org/jira/browse/RATIS-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2157: -- Component/s: build Fix Version/s: 3.1.1 (was: 3.2.0) > Enhance make_rc.sh for non-first rc at release time > --- > > Key: RATIS-2157 > URL: https://issues.apache.org/jira/browse/RATIS-2157 > Project: Ratis > Issue Type: Improvement > Components: build >Reporter: Xinyu Tan >Assignee: Xinyu Tan >Priority: Major > Fix For: 3.1.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > {code:java} > git commit -a -m "Change version for the version $RATISVERSION $RC > {code} > may fail when sending a subsequent RC for the current version because the mvn > version has already been executed. Therefore, there are no changes to commit. > So we need to add a -allow-empty directive > BTW, it maybe better to remove -X for publish-mvn as it prints a lot of logs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2155) Add a builder for RatisShell
[ https://issues.apache.org/jira/browse/RATIS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2155: -- Fix Version/s: 3.1.1 (was: 3.2.0) > Add a builder for RatisShell > > > Key: RATIS-2155 > URL: https://issues.apache.org/jira/browse/RATIS-2155 > Project: Ratis > Issue Type: New Feature > Components: shell >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Fix For: 3.1.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently, RatisShell is executed via CLI. It will use the default > RaftProperties and a null Parameters to build a RaftClient. There is no way > to pass TlsConf, as a result, RatisShell cannot access secure clusters. > This JIRA is to add a builder in order to pass RaftProperties and Parameters. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2158) Let the snapshot sender and receiver use a new digester each time
[ https://issues.apache.org/jira/browse/RATIS-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2158. --- Fix Version/s: 3.1.1 Resolution: Fixed The pull request is now merged. Thanks, [~tohsakarin__]! > Let the snapshot sender and receiver use a new digester each time > -- > > Key: RATIS-2158 > URL: https://issues.apache.org/jira/browse/RATIS-2158 > Project: Ratis > Issue Type: Wish > Components: server >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.1.1 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > This is a follow-up improvement to issue RATIS-2147 MD5 mismatch when accept > snapshot - ASF JIRA (apache.org). > The pr of 2147: [RATIS-2147. Md5 mismatch when snapshot install by > 133tosakarin · Pull Request #1142 · apache/ratis > (github.com)|https://github.com/apache/ratis/pull/1142] > > Since snapshot files are not sent frequently and there is not much > performance loss when using a new digester each time, in order to be more > secure, the snapshot sender and receiver should use a new digester each time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (RATIS-2158) Let the snapshot sender and receiver use a new digester each time
[ https://issues.apache.org/jira/browse/RATIS-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned RATIS-2158: - Component/s: server Assignee: yuuka > Let the snapshot sender and receiver use a new digester each time > -- > > Key: RATIS-2158 > URL: https://issues.apache.org/jira/browse/RATIS-2158 > Project: Ratis > Issue Type: Wish > Components: server >Reporter: yuuka >Assignee: yuuka >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > This is a follow-up improvement to issue RATIS-2147 MD5 mismatch when accept > snapshot - ASF JIRA (apache.org). > The pr of 2147: [RATIS-2147. Md5 mismatch when snapshot install by > 133tosakarin · Pull Request #1142 · apache/ratis > (github.com)|https://github.com/apache/ratis/pull/1142] > > Since snapshot files are not sent frequently and there is not much > performance loss when using a new digester each time, in order to be more > secure, the snapshot sender and receiver should use a new digester each time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HDDS-11470) OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed
[ https://issues.apache.org/jira/browse/HDDS-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated HDDS-11470: -- Description: When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply "Completed INSTALL_SNAPSHOT". In the code below, when there is an exception, it just print an error message and continue to reply "Completed INSTALL_SNAPSHOT". {code} //OzoneManager.installCheckpoint try { time = Time.monotonicNow(); dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation, checkpointLocation); term = checkpointTrxnInfo.getTerm(); lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex(); LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, " + "index: {}, time: {} ms", leaderId, term, lastAppliedIndex, Time.monotonicNow() - time); } catch (Exception e) { LOG.error("Failed to install Snapshot from {} as OM failed to replace" + " DB with downloaded checkpoint. Reloading old OM state.", leaderId, e); } {code} was: When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply "Completed INSTALL_SNAPSHOT". {code} {code} > OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed > > > Key: HDDS-11470 > URL: https://issues.apache.org/jira/browse/HDDS-11470 > Project: Apache Ozone > Issue Type: Bug > Components: OM HA >Reporter: Tsz-wo Sze >Priority: Major > > When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply > "Completed INSTALL_SNAPSHOT". > In the code below, when there is an exception, it just print an error message > and continue to reply "Completed INSTALL_SNAPSHOT". > {code} > //OzoneManager.installCheckpoint > try { > time = Time.monotonicNow(); > dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, > oldDBLocation, checkpointLocation); > term = checkpointTrxnInfo.getTerm(); > lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex(); > LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, " + > "index: {}, time: {} ms", leaderId, term, lastAppliedIndex, > Time.monotonicNow() - time); > } catch (Exception e) { > LOG.error("Failed to install Snapshot from {} as OM failed to > replace" + > " DB with downloaded checkpoint. Reloading old OM state.", > leaderId, e); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Created] (HDDS-11470) OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed
Tsz-wo Sze created HDDS-11470: - Summary: OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed Key: HDDS-11470 URL: https://issues.apache.org/jira/browse/HDDS-11470 Project: Apache Ozone Issue Type: Bug Components: OM HA Reporter: Tsz-wo Sze When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply "Completed INSTALL_SNAPSHOT". {code} {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Commented] (HDDS-11466) MetricsSystemImpl should not print INFO message in CLI
[ https://issues.apache.org/jira/browse/HDDS-11466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882737#comment-17882737 ] Tsz-wo Sze commented on HDDS-11466: --- We may simpler change the INFO message to debug. bq. 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties For the WARN message above, is it really needed? For CLI, we should minimize the unrelated messages. > MetricsSystemImpl should not print INFO message in CLI > -- > > Key: HDDS-11466 > URL: https://issues.apache.org/jira/browse/HDDS-11466 > Project: Apache Ozone > Issue Type: Improvement > Components: metrics >Reporter: Tsz-wo Sze >Assignee: Sarveksha Yeshavantha Raju >Priority: Major > Labels: newbie > > Below is an example: > {code} > # hadoop fs -Dfs.s3a.bucket.probe=0 > -Dfs.s3a.change.detection.version.required=false > -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 > -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 > -Dfs.s3a.endpoint=http://some.site:9878 -Dfs.s3a.path.style.access=true > -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -ls -R s3a://bucket1/ > 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot > period at 10 second(s). > 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > started > 24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option > fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 > ms instead > 24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an > instance of MultipartS3AsyncClient, and thus multipart download feature is > not enabled. To benefit from all features, consider using > S3AsyncClient.crtBuilder().build() instead. > drwxrwxrwx - root root 0 2024-09-17 10:47 s3a://bucket1/dir1 > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > stopped. > 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system > shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Resolved] (RATIS-2155) Add a builder for RatisShell
[ https://issues.apache.org/jira/browse/RATIS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2155. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. > Add a builder for RatisShell > > > Key: RATIS-2155 > URL: https://issues.apache.org/jira/browse/RATIS-2155 > Project: Ratis > Issue Type: New Feature > Components: shell >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Fix For: 3.2.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently, RatisShell is executed via CLI. It will use the default > RaftProperties and a null Parameters to build a RaftClient. There is no way > to pass TlsConf, as a result, RatisShell cannot access secure clusters. > This JIRA is to add a builder in order to pass RaftProperties and Parameters. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HDDS-11466) MetricsSystemImpl should not print INFO message in CLI
Tsz-wo Sze created HDDS-11466: - Summary: MetricsSystemImpl should not print INFO message in CLI Key: HDDS-11466 URL: https://issues.apache.org/jira/browse/HDDS-11466 Project: Apache Ozone Issue Type: Improvement Components: metrics Reporter: Tsz-wo Sze Below is an example: {code} # hadoop fs -Dfs.s3a.bucket.probe=0 -Dfs.s3a.change.detection.version.required=false -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 -Dfs.s3a.endpoint=http://some.site:9878 -Dfs.s3a.path.style.access=true -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -ls -R s3a://bucket1/ 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started 24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 ms instead 24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an instance of MultipartS3AsyncClient, and thus multipart download feature is not enabled. To benefit from all features, consider using S3AsyncClient.crtBuilder().build() instead. drwxrwxrwx - root root 0 2024-09-17 10:47 s3a://bucket1/dir1 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Commented] (RATIS-2156) Notify follower slowness based on the log index
[ https://issues.apache.org/jira/browse/RATIS-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882101#comment-17882101 ] Tsz-wo Sze commented on RATIS-2156: --- Agree. The "StatusRuntimeException: CANCELLED: RST_STREAM closed stream. HTTP/2 error code: CANCEL" message seems to be caused by RATIS-2135. > Notify follower slowness based on the log index > --- > > Key: RATIS-2156 > URL: https://issues.apache.org/jira/browse/RATIS-2156 > Project: Ratis > Issue Type: Improvement > Components: Leader >Reporter: Ivan Andika >Assignee: Ivan Andika >Priority: Major > Attachments: image-2024-09-13-18-54-04-203.png > > > Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on > raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases > where the rpc rtt between the leader and follower does not exceed the > timeout, the difference of the log index between the leader and follower > keeps increasing, i.e. the slow follower cannot catch up. > In Ozone, this causes most watch requests with ALL_COMMITTED replication to > timeout, causing increased latency of writes. It is better to close the > pipeline if the slow follower cannot catch up. > !image-2024-09-13-18-54-04-203.png|width=1408,height=244! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (RATIS-2159) TestRaftWithSimulatedRpc could "fail to retain".
Tsz-wo Sze created RATIS-2159: - Summary: TestRaftWithSimulatedRpc could "fail to retain". Key: RATIS-2159 URL: https://issues.apache.org/jira/browse/RATIS-2159 Project: Ratis Issue Type: Bug Reporter: Tsz-wo Sze {code} Error: Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 42.31 s <<< FAILURE! - in org.apache.ratis.server.simulation.TestRaftWithSimulatedRpc Error: org.apache.ratis.server.simulation.TestRaftWithSimulatedRpc.testWithLoad Time elapsed: 8.47 s <<< ERROR! java.lang.IllegalStateException: Failed to retain: object has already been completely released. at org.apache.ratis.util.ReferenceCountedLeakDetector$Impl.retain(ReferenceCountedLeakDetector.java:116) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainLog(SegmentedRaftLog.java:310) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:285) at org.apache.ratis.RaftTestUtil.logEntriesContains(RaftTestUtil.java:187) at org.apache.ratis.RaftTestUtil.logEntriesContains(RaftTestUtil.java:172) at org.apache.ratis.RaftTestUtil.lambda$assertLogEntries$5(RaftTestUtil.java:250) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174) ... at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:593) at org.apache.ratis.RaftTestUtil.assertLogEntries(RaftTestUtil.java:251) at org.apache.ratis.RaftTestUtil.assertLogEntries(RaftTestUtil.java:242) at org.apache.ratis.RaftBasicTests.testWithLoad(RaftBasicTests.java:424) at org.apache.ratis.RaftBasicTests.lambda$testWithLoad$8(RaftBasicTests.java:344) at org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:143) at org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:121) at org.apache.ratis.RaftBasicTests.testWithLoad(RaftBasicTests.java:344) ... {code} See https://github.com/apache/ratis/actions/runs/10865610568/job/30154388685?pr=1150#step:5:741 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2127) TestRetryCacheWithGrpc may fail with object already completely released.
[ https://issues.apache.org/jira/browse/RATIS-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881861#comment-17881861 ] Tsz-wo Sze commented on RATIS-2127: --- TestRaftWithSimulatedRpc could also fail although the exception stack trace is different from here; filed RATIS-2159. > TestRetryCacheWithGrpc may fail with object already completely released. > > > Key: RATIS-2127 > URL: https://issues.apache.org/jira/browse/RATIS-2127 > Project: Ratis > Issue Type: Sub-task > Components: gRPC >Reporter: Tsz-wo Sze >Assignee: Duong >Priority: Blocker > > Found IllegalStateException: Failed to release: object has already been > completely released. > {code} > 2024-06-04 12:00:35,728 > [s0@group-11B7B1EB32F8->s4-GrpcLogAppender-LogAppenderDaemon] WARN > leader.LogAppenderDaemon (LogAppenderDaemon.java:run(89)) - > s0@group-11B7B1EB32F8->s4-GrpcLogAppender-LogAppenderDaemon failed > java.lang.IllegalStateException: Failed to release: object has already been > completely released. > at > org.apache.ratis.util.ReferenceCountedLeakDetector$Impl.release(ReferenceCountedLeakDetector.java:130) > at > org.apache.ratis.util.ReferenceCountedLeakDetector$SimpleTracing.release(ReferenceCountedLeakDetector.java:152) > at > org.apache.ratis.util.ReferenceCountedObject$3.release(ReferenceCountedObject.java:150) > at > org.apache.ratis.util.ReferenceCountedObject$2.release(ReferenceCountedObject.java:122) > at > org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:414) > at > org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:262) > at > org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80) > at java.lang.Thread.run(Thread.java:750) > {code} > See [the > logs|https://issues.apache.org/jira/secure/attachment/13069289/TestRetryCacheWithGrpc.tar.gz]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2156) Notify follower slowness based on the log index
[ https://issues.apache.org/jira/browse/RATIS-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2156: -- Component/s: Leader Thanks for filing the JIRA and working on this! > Notify follower slowness based on the log index > --- > > Key: RATIS-2156 > URL: https://issues.apache.org/jira/browse/RATIS-2156 > Project: Ratis > Issue Type: Improvement > Components: Leader >Reporter: Ivan Andika >Assignee: Ivan Andika >Priority: Major > Attachments: image-2024-09-13-18-54-04-203.png > > > Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on > raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases > where the rpc rtt between the leader and follower does not exceed the > timeout, the difference of the log index between the leader and follower > keeps increasing, i.e. the slow follower cannot catch up. > In Ozone, this causes most watch requests with ALL_COMMITTED replication to > timeout, causing increased latency of writes. It is better to close the > pipeline if the slow follower cannot catch up. > !image-2024-09-13-18-54-04-203.png|width=1408,height=244! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (RATIS-2155) Add a builder for RatisShell
Tsz-wo Sze created RATIS-2155: - Summary: Add a builder for RatisShell Key: RATIS-2155 URL: https://issues.apache.org/jira/browse/RATIS-2155 Project: Ratis Issue Type: New Feature Components: shell Reporter: Tsz-wo Sze Assignee: Tsz-wo Sze Currently, RatisShell is executed via CLI. It will use the default RaftProperties and a null Parameters to build a RaftClient. There is no way to pass TlsConf, as a result, RatisShell cannot access secure clusters. This JIRA is to add a builder in order to pass RaftProperties and Parameters. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2152) GrpcLogAppender stucks while sending an installSnapshot notification request
[ https://issues.apache.org/jira/browse/RATIS-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2152. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~wfps1210]! > GrpcLogAppender stucks while sending an installSnapshot notification request > > > Key: RATIS-2152 > URL: https://issues.apache.org/jira/browse/RATIS-2152 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Chung En Lee >Assignee: Chung En Lee >Priority: Major > Fix For: 3.2.0 > > Time Spent: 1h > Remaining Estimate: 0h > > In `GrpcLogAppender`, it waits for signal at the end of > `notifyInstallSnapshot` as following. > [https://github.com/apache/ratis/blob/master/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L825-L831] > However, checking whether the `InstallSnapshotResponseHandler` is done and > the call `AwaitForSignal.await()` are not atomic. This creates a potential > race condition where InstallSnapshotResponseHandler.close() could finish > after the check but before the wait, causing that `GrpcLogAppender` is still > waiting even though `InstallSnapshotResponseHandler` has already completed, > leading to timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2152) GrpcLogAppender stucks while sending an installSnapshot notification request
[ https://issues.apache.org/jira/browse/RATIS-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2152: -- Component/s: gRPC > GrpcLogAppender stucks while sending an installSnapshot notification request > > > Key: RATIS-2152 > URL: https://issues.apache.org/jira/browse/RATIS-2152 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Chung En Lee >Assignee: Chung En Lee >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > In `GrpcLogAppender`, it waits for signal at the end of > `notifyInstallSnapshot` as following. > [https://github.com/apache/ratis/blob/master/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L825-L831] > However, checking whether the `InstallSnapshotResponseHandler` is done and > the call `AwaitForSignal.await()` are not atomic. This creates a potential > race condition where InstallSnapshotResponseHandler.close() could finish > after the check but before the wait, causing that `GrpcLogAppender` is still > waiting even though `InstallSnapshotResponseHandler` has already completed, > leading to timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2154) The old leader may send appendEntries after term changed
[ https://issues.apache.org/jira/browse/RATIS-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2154. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~tohsakarin__]! > The old leader may send appendEntries after term changed > > > Key: RATIS-2154 > URL: https://issues.apache.org/jira/browse/RATIS-2154 > Project: Ratis > Issue Type: Wish > Components: Leader >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2024-09-12-09-43-30-670.png > > Time Spent: 20m > Remaining Estimate: 0h > > The leader will become a follower after receiving a higher term, but during > this process, the old leader may be appending LogEntry, and the error log > will be printed until LogAppenderDaemon is closed. > !image-2024-09-12-09-43-30-670.png! > > I think we can put state.updateCurrentTerm (newTerm) later. Close LeaderState > first before updating the term, and other operations remain unchanged. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (RATIS-2154) The old leader may send appendEntries after term changed
[ https://issues.apache.org/jira/browse/RATIS-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned RATIS-2154: - Component/s: Leader Assignee: yuuka Summary: The old leader may send appendEntries after term changed (was: Discussion on follow-up when deleting unnecessary error logs GC severe in JVMPauseMonitor) > The old leader may send appendEntries after term changed > > > Key: RATIS-2154 > URL: https://issues.apache.org/jira/browse/RATIS-2154 > Project: Ratis > Issue Type: Wish > Components: Leader >Reporter: yuuka >Assignee: yuuka >Priority: Major > Attachments: image-2024-09-12-09-43-30-670.png > > Time Spent: 20m > Remaining Estimate: 0h > > The leader will become a follower after receiving a higher term, but during > this process, the old leader may be appending LogEntry, and the error log > will be printed until LogAppenderDaemon is closed. > !image-2024-09-12-09-43-30-670.png! > > I think we can put state.updateCurrentTerm (newTerm) later. Close LeaderState > first before updating the term, and other operations remain unchanged. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2149) Do not perform leader election if the current RaftServer has not started yet
[ https://issues.apache.org/jira/browse/RATIS-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2149. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~tohsakarin__]! > Do not perform leader election if the current RaftServer has not started yet > > > Key: RATIS-2149 > URL: https://issues.apache.org/jira/browse/RATIS-2149 > Project: Ratis > Issue Type: Improvement > Components: election >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2024-09-03-17-41-41-872.png, > image-2024-09-03-18-13-50-628.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Sometimes we cannot guarantee that the program will run normally in various > environments, and appropriate robustness enhancement may be necessary. > Before adding members, RaftServer S and the corresponding group will be > created if the group is not exist and find that the interval between these > two logs is more than one minute. > !image-2024-09-03-17-41-41-872.png! > > Since our RpcTimeout is small 1 minute, the retryPolicy has already started, > but S's groupId is already in the implMaps of RaftServerProxy, which will > throw AlreadyExistException. When we catch this exception, we assume that the > creation has been completed and the member change can be executed > > S is still in the initializing state, and this member change will not be > completed. Finally, we found that S started the election and received > NOT_IN_CONF reply, and then S will be closed > !image-2024-09-03-18-13-50-628.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2148) Snapshot transfer may cause followers to trigger reloadStateMachine incorrectly
[ https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2148. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~tohsakarin__] ! > Snapshot transfer may cause followers to trigger reloadStateMachine > incorrectly > --- > > Key: RATIS-2148 > URL: https://issues.apache.org/jira/browse/RATIS-2148 > Project: Ratis > Issue Type: Bug > Components: snapshot >Affects Versions: 3.1.0 >Reporter: yuuka >Assignee: yuuka >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2024-09-03-14-24-25-652.png, > image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, > image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, > image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Due to the fact that grpc streaming snapshot sending sends all requests at > once, error handling is performed after all are sent, and the last snapshot > request is used as a completion flag, which may lead to the successful > receipt of the last request, but the previous request has failed. The sender > handles the failure event during the retransmission of the snapshot. The > receiver triggers state.reloadStateMachine because it successfully receives > the last request, but due to incomplete snapshot reception > > An md5 mismatch exception occurred before the last SnapshotRequest was > received > !image-2024-09-03-14-27-39-406.png! > > The last snapshot request arrived, then successfully received, and then > updated the index. > !image-2024-09-03-14-28-31-529.png! > !image-2024-09-03-14-30-02-751.png! > > However, the snapshot reception is incomplete and triggers the > reloadStateMachine. > !image-2024-09-03-14-33-49-573.png! > > I suggest using a flag to identify whether the entire snapshot request is > abnormal. > If an exception occurs, the subsequent content of the request will not be > processed. > Or the sender will wait for the receiver's reply. If there is a release > error, resend it. > > Finally, the current error retry level is the entire snapshot directory > rather than a single chunk, which will cause a large number of snapshot files > to be sent repeatedly, which can be optimized later -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (RATIS-2151) TestRaftWithGrpc may fail after RATIS-2129
Tsz-wo Sze created RATIS-2151: - Summary: TestRaftWithGrpc may fail after RATIS-2129 Key: RATIS-2151 URL: https://issues.apache.org/jira/browse/RATIS-2151 Project: Ratis Issue Type: Task Reporter: Tsz-wo Sze Assignee: Tsz-wo Sze - After RATIS-2129: TestRaftWithGrpc#[781d61d37411b374f104eb0806e1e2c4090fb35e]-10x10: 91/100 failures https://github.com/szetszwo/ratis/actions/runs/10747241634/job/29810232738 - Before RATIS-2129: TestRaftWithGrpc#[dfed1012983d1d7b5fb2c408e19b8661cbe000b4]-10x10 success https://github.com/szetszwo/ratis/actions/runs/10746526581 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception
[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879663#comment-17879663 ] Tsz-wo Sze commented on RATIS-2137: --- [~lemony], are you going to build it yourself? Or do you need a release based on 2.5.1? Please feel free to share your thought. > Leader fails to send correct index to follower after timeout exception > -- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 2.5.1 >Reporter: Kevin Liu >Assignee: Kevin Liu >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2024-08-13-11-28-16-250.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > *(repeat 'Failed appendEntries')* > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0 > 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An > exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to > CANDIDATE at term 59 for changeToCandidate > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] > RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default) > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > start 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at > term 59 for PRE_VOTE > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit > vote requests at term 59 for 34233595: > peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 3|rpc:xxx:9862|admin:|c
[jira] [Updated] (RATIS-2150) No need for manual assembly:single execution when mvn depoly
[ https://issues.apache.org/jira/browse/RATIS-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2150: -- Component/s: build > No need for manual assembly:single execution when mvn depoly > > > Key: RATIS-2150 > URL: https://issues.apache.org/jira/browse/RATIS-2150 > Project: Ratis > Issue Type: Improvement > Components: build >Reporter: Xinyu Tan >Assignee: Xinyu Tan >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > This [RATIS-2117|https://issues.apache.org/jira/browse/RATIS-2117] ignores > the mvn deoply command update, which will be updated in this ISSUE -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (RATIS-2147) MD5 mismatch when accept snapshot
[ https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned RATIS-2147: - Assignee: yuuka > MD5 mismatch when accept snapshot > - > > Key: RATIS-2147 > URL: https://issues.apache.org/jira/browse/RATIS-2147 > Project: Ratis > Issue Type: Bug > Components: snapshot >Affects Versions: 3.1.0, 3.2.0 >Reporter: yuuka >Assignee: yuuka >Priority: Major > Attachments: image-2024-09-03-10-35-08-315.png, > image-2024-09-03-10-35-28-617.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We encountered an MD5 mismatch issue in IoTDB, and after multiple > investigations, we found that the digester was contaminated > > We have checked that it is not a network and disk problem > > In implementation, the received snapshot will be written to a temporary file > first. If there is an md5 mismatch, we will read the data from this temporary > file and use a new digest to calculate md5, but the result of this > calculation is the same as the md5 hash value sent > !image-2024-09-03-10-35-28-617.png! > > !image-2024-09-03-10-35-08-315.png! > > > Use the saved corrupted file name to locate the relevant log, here to > tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735 > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA! > Before encountering corrupt, the sender sent several consecutive snapshot > installation requests to the receiver. > > The receiver successfully received some requests, and then encountered a > request for corrupt, and began printing "recompute again" to start > recalculating. > > After execution, the ERROR log of the rename will be printed, and the data > will be read from the file and compared with the received chunk data. > > If a byte does not match, the corresponding information will be printed, but > no log information will be printed, which means that the content written to > the disk is the same as the content sent > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA! > This makes the problem very clear. There is a problem with the MD5 > calculation class, and the reasons are as follows: > > If a byte in the middle of the data part is incorrect due to network > reasons, the calculated result and the hash sent must be different > > If there is a problem with the part that stores the hash value, the final > calculation result will also be different. > > I suggest creating a new digest every time follower receive a snapshot, so as > to avoid pollution problems. Under normal network and disk conditions, > Corrupt will not occur -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (RATIS-2149) Do not perform leader election if the current RaftServer has not started yet
[ https://issues.apache.org/jira/browse/RATIS-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned RATIS-2149: - Component/s: election Assignee: yuuka Issue Type: Improvement (was: Wish) > Do not perform leader election if the current RaftServer has not started yet > > > Key: RATIS-2149 > URL: https://issues.apache.org/jira/browse/RATIS-2149 > Project: Ratis > Issue Type: Improvement > Components: election >Reporter: yuuka >Assignee: yuuka >Priority: Major > Attachments: image-2024-09-03-17-41-41-872.png, > image-2024-09-03-18-13-50-628.png > > Time Spent: 1h > Remaining Estimate: 0h > > Sometimes we cannot guarantee that the program will run normally in various > environments, and appropriate robustness enhancement may be necessary. > Before adding members, RaftServer S and the corresponding group will be > created if the group is not exist and find that the interval between these > two logs is more than one minute. > !image-2024-09-03-17-41-41-872.png! > > Since our RpcTimeout is small 1 minute, the retryPolicy has already started, > but S's groupId is already in the implMaps of RaftServerProxy, which will > throw AlreadyExistException. When we catch this exception, we assume that the > creation has been completed and the member change can be executed > > S is still in the initializing state, and this member change will not be > completed. Finally, we found that S started the election and received > NOT_IN_CONF reply, and then S will be closed > !image-2024-09-03-18-13-50-628.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception
[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879295#comment-17879295 ] Tsz-wo Sze commented on RATIS-2137: --- For 2.5.1, let's also cherry-pick RATIS-1902. Just have tried it. The code conflict is minor. After that, this (RATIS-2137) can be cherry-picked cleanly. > Leader fails to send correct index to follower after timeout exception > -- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 2.5.1 >Reporter: Kevin Liu >Assignee: Kevin Liu >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2024-08-13-11-28-16-250.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > *(repeat 'Failed appendEntries')* > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0 > 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An > exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to > CANDIDATE at term 59 for changeToCandidate > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] > RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default) > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > start 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at > term 59 for PRE_VOTE > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit > vote requests at term 59 for 34233595: > peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, >
[jira] [Commented] (RATIS-2148) GRPC streaming snapshot transfer may cause followers to trigger reloadStateMachine incorrectly
[ https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878970#comment-17878970 ] Tsz-wo Sze commented on RATIS-2148: --- That's great, thanks! > GRPC streaming snapshot transfer may cause followers to trigger > reloadStateMachine incorrectly > -- > > Key: RATIS-2148 > URL: https://issues.apache.org/jira/browse/RATIS-2148 > Project: Ratis > Issue Type: Bug > Components: snapshot >Affects Versions: 3.1.0, 3.2.0 >Reporter: yuuka >Priority: Major > Attachments: image-2024-09-03-14-24-25-652.png, > image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, > image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, > image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png > > > Due to the fact that grpc streaming snapshot sending sends all requests at > once, error handling is performed after all are sent, and the last snapshot > request is used as a completion flag, which may lead to the successful > receipt of the last request, but the previous request has failed. The sender > handles the failure event during the retransmission of the snapshot. The > receiver triggers state.reloadStateMachine because it successfully receives > the last request, but due to incomplete snapshot reception > > An md5 mismatch exception occurred before the last SnapshotRequest was > received > !image-2024-09-03-14-27-39-406.png! > > The last snapshot request arrived, then successfully received, and then > updated the index. > !image-2024-09-03-14-28-31-529.png! > !image-2024-09-03-14-30-02-751.png! > > However, the snapshot reception is incomplete and triggers the > reloadStateMachine. > !image-2024-09-03-14-33-49-573.png! > > I suggest using a flag to identify whether the entire snapshot request is > abnormal. > If an exception occurs, the subsequent content of the request will not be > processed. > Or the sender will wait for the receiver's reply. If there is a release > error, resend it. > > Finally, the current error retry level is the entire snapshot directory > rather than a single chunk, which will cause a large number of snapshot files > to be sent repeatedly, which can be optimized later -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2148) GRPC streaming snapshot transfer may cause followers to trigger reloadStateMachine incorrectly
[ https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878962#comment-17878962 ] Tsz-wo Sze commented on RATIS-2148: --- https://github.com/apache/ratis/blob/8f5159db4cade67b96c7b9c8589e7c0cdba571e0/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotInstallationHandler.java#L187-L191 {code} //SnapshotInstallationHandler.java // update the committed index // re-load the state machine if this is the last chunk if (snapshotChunkRequest.getDone()) { state.reloadStateMachine(lastIncluded); } {code} You are right. The SnapshotInstallationHandler code above should check if the request is completed successfully before calling reloadStateMachine(..). Would you like to provide a pull request? > GRPC streaming snapshot transfer may cause followers to trigger > reloadStateMachine incorrectly > -- > > Key: RATIS-2148 > URL: https://issues.apache.org/jira/browse/RATIS-2148 > Project: Ratis > Issue Type: Bug > Components: snapshot >Affects Versions: 3.1.0, 3.2.0 >Reporter: yuuka >Priority: Major > Attachments: image-2024-09-03-14-24-25-652.png, > image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, > image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, > image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png > > > Due to the fact that grpc streaming snapshot sending sends all requests at > once, error handling is performed after all are sent, and the last snapshot > request is used as a completion flag, which may lead to the successful > receipt of the last request, but the previous request has failed. The sender > handles the failure event during the retransmission of the snapshot. The > receiver triggers state.reloadStateMachine because it successfully receives > the last request, but due to incomplete snapshot reception > > An md5 mismatch exception occurred before the last SnapshotRequest was > received > !image-2024-09-03-14-27-39-406.png! > > The last snapshot request arrived, then successfully received, and then > updated the index. > !image-2024-09-03-14-28-31-529.png! > !image-2024-09-03-14-30-02-751.png! > > However, the snapshot reception is incomplete and triggers the > reloadStateMachine. > !image-2024-09-03-14-33-49-573.png! > > I suggest using a flag to identify whether the entire snapshot request is > abnormal. > If an exception occurs, the subsequent content of the request will not be > processed. > Or the sender will wait for the receiver's reply. If there is a release > error, resend it. > > Finally, the current error retry level is the entire snapshot directory > rather than a single chunk, which will cause a large number of snapshot files > to be sent repeatedly, which can be optimized later -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2147) MD5 missmatch when accept snapshot
[ https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878701#comment-17878701 ] Tsz-wo Sze commented on RATIS-2147: --- [~tohsakarin__], thanks for reporting the problem! Would you like to submit a pull request? > MD5 missmatch when accept snapshot > -- > > Key: RATIS-2147 > URL: https://issues.apache.org/jira/browse/RATIS-2147 > Project: Ratis > Issue Type: Bug > Components: snapshot >Affects Versions: 3.1.0, 3.2.0 >Reporter: yuuka >Priority: Major > Attachments: image-2024-09-03-10-35-08-315.png, > image-2024-09-03-10-35-28-617.png > > > We encountered an MD5 mismatch issue in IoTDB, and after multiple > investigations, we found that the digester was contaminated > > We have checked that it is not a network and disk problem > > In implementation, the received snapshot will be written to a temporary file > first. If there is an md5 mismatch, we will read the data from this temporary > file and use a new digest to calculate md5, but the result of this > calculation is the same as the md5 hash value sent > !image-2024-09-03-10-35-28-617.png! > > !image-2024-09-03-10-35-08-315.png! > > > Use the saved corrupted file name to locate the relevant log, here to > tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735 > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YjM4MWY1MTA2Y2EyYWU4MmZlNDE0Mzg3MDRjYTBjMjRfU0dPbEpVbWFNalV1V1lSUVllOGFISUdWbUhqanRFdFdfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA! > Before encountering corrupt, the sender sent several consecutive snapshot > installation requests to the receiver. > > The receiver successfully received some requests, and then encountered a > request for corrupt, and began printing "recompute again" to start > recalculating. > > After execution, the ERROR log of the rename will be printed, and the data > will be read from the file and compared with the received chunk data. > > If a byte does not match, the corresponding information will be printed, but > no log information will be printed, which means that the content written to > the disk is the same as the content sent > !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YmZlYjk1YjAwOWE4MDJlYTEzZjkxMjljODU1MzQxMTZfMkU0NmlPRWpidDBweGNzWXY4cHNJZG14b1o3Z1BZMzhfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA! > This makes the problem very clear. There is a problem with the MD5 > calculation class, and the reasons are as follows: > > If a byte in the middle of the data part is incorrect due to network > reasons, the calculated result and the hash sent must be different > > If there is a problem with the part that stores the hash value, the final > calculation result will also be different. > > I suggest creating a new digest every time follower receive a snapshot, so as > to avoid pollution problems. Under normal network and disk conditions, > Corrupt will not occur -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2146) Fixed possible issues caused by concurrent deletion and election when member changes
[ https://issues.apache.org/jira/browse/RATIS-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2146. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~tanxinyu]! > Fixed possible issues caused by concurrent deletion and election when member > changes > > > Key: RATIS-2146 > URL: https://issues.apache.org/jira/browse/RATIS-2146 > Project: Ratis > Issue Type: Improvement > Components: server >Reporter: Xinyu Tan >Assignee: Xinyu Tan >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2024-08-28-14-53-23-259.png, > image-2024-08-28-14-53-27-637.png > > Time Spent: 1h > Remaining Estimate: 0h > > During this process, we encountered some concurrency issues: > * After the member change is complete, node D will no longer be a member of > this consensus group. It will attempt to initiate an election but receive a > NOT_IN_CONF response, after which it will close itself. > * During the removal of member D, it will also close itself first, and then > proceed to delete the file directory. > These two CLOSE operations may occur concurrently, which could result in the > directory being deleted while the StateMachineUpdater thread has not yet > closed, ultimately leading to unexpected errors. > !image-2024-08-28-14-53-23-259.png! > !image-2024-08-28-14-53-27-637.png! > I believe there are two possible solutions for this issue: > * Add concurrency control to the close function, such as adding the > synchronized keyword to the function. > * Add some checks before deleting the directory to ensure that the callback > functions in the close process have already been executed before the > directory is deleted. > What's your opinion? [~szetszwo] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2146) Fixed possible issues caused by concurrent deletion and election when member changes
[ https://issues.apache.org/jira/browse/RATIS-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2146: -- Component/s: server > Fixed possible issues caused by concurrent deletion and election when member > changes > > > Key: RATIS-2146 > URL: https://issues.apache.org/jira/browse/RATIS-2146 > Project: Ratis > Issue Type: Improvement > Components: server >Reporter: Xinyu Tan >Assignee: Xinyu Tan >Priority: Major > Attachments: image-2024-08-28-14-53-23-259.png, > image-2024-08-28-14-53-27-637.png > > Time Spent: 1h > Remaining Estimate: 0h > > During this process, we encountered some concurrency issues: > * After the member change is complete, node D will no longer be a member of > this consensus group. It will attempt to initiate an election but receive a > NOT_IN_CONF response, after which it will close itself. > * During the removal of member D, it will also close itself first, and then > proceed to delete the file directory. > These two CLOSE operations may occur concurrently, which could result in the > directory being deleted while the StateMachineUpdater thread has not yet > closed, ultimately leading to unexpected errors. > !image-2024-08-28-14-53-23-259.png! > !image-2024-08-28-14-53-27-637.png! > I believe there are two possible solutions for this issue: > * Add concurrency control to the close function, such as adding the > synchronized keyword to the function. > * Add some checks before deleting the directory to ensure that the callback > functions in the close process have already been executed before the > directory is deleted. > What's your opinion? [~szetszwo] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2129) Low replication performance because of lock contention on RaftLog
[ https://issues.apache.org/jira/browse/RATIS-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2129. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request [#1141|https://github.com/apache/ratis/pull/1141] is now merged to the master branch. > Low replication performance because of lock contention on RaftLog > - > > Key: RATIS-2129 > URL: https://issues.apache.org/jira/browse/RATIS-2129 > Project: Ratis > Issue Type: Improvement > Components: server >Affects Versions: 3.1.0 >Reporter: Duong >Assignee: Tsz-wo Sze >Priority: Blocker > Labels: Performance, performance > Fix For: 3.2.0 > > Attachments: Screenshot 2024-07-22 at 4.40.07 PM-1.png, Screenshot > 2024-07-22 at 4.40.07 PM.png, dn_echo_leader_profile.html, > image-2024-07-22-15-25-46-155.png, ratis_ratfLog_lock_contention.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Today, the GrpcLogAppender thread makes a lot of calls that need RaftLog's > readLock. In an active environment, RaftLog is always busy appending > transactions from clients, thus writeLock is frequently busy. This makes the > replication performance slow. > See the [^dn_echo_leader_profile.html], or in the picture below, the purple > is the time taken to acquire readLock from RaftLog. > # !image-2024-07-22-15-25-46-155.png|width=854,height=425! > h2. A summary of LockContention in Ratis. > h2. > !ratis_ratfLog_lock_contention.png|width=392,height=380! > Today, RaftLog consistency is protected by a global ReadWriteLock. (global > means RaftLog has a single ReadWriteLock and the lock is acquired at the > scope of the RaftLog instance, or a RaftGroup). > In a RaftGroup, the following actors race to obtain this global ReadWriteLock > in the leader node: > * The writer, which is the GRPC Client Service, accepts transaction > submissions from Raft clients and appends transactions (or log entries) to > RaftLog. Each append operation needs to acquire the writeLock from RaftLog to > put the transaction to RaftLog's memory queue. Although each of these append > operations is quick, Ratis is designed to maximize transactions append and so > the writeLock should be always busy. > * StateMachineUpdater. For each transaction, when it is acknowledged by > enough followers, this single thread actor will read the log from RaftLog and > call StateMachine to apply the transaction. This actor acquires readLock from > RaftLog for each log entry read. > * GrpcLogAppender: for each follower, there's a thread of GrpcLogAppender > that constantly reads log entries from RaftLog and replicates them to the > follower. This thread acquires readLock from RaftLog every time it reads a > log entry. > All writer, StateMachineUpdater, and GrpcLogAppender are all designed in a > way to maximize their throughput. For instance, StateMachineUpdater invokes > StateMachine's applyTransaction as asynchronous calls. The same is the way > GrpcLogAppender replicates log entries to the follower. > The global ReadWriteLock *creates a tough contention* between the RaftLog > writers and readers. And that's what limit the ratis throughput down. The > faster the writers and readers are, the more they block each other. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2145) Follower hangs until the next trigger to take a snapshot
[ https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2145. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~z-bb]! > Follower hangs until the next trigger to take a snapshot > - > > Key: RATIS-2145 > URL: https://issues.apache.org/jira/browse/RATIS-2145 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Assignee: guangbao zhao >Priority: Major > Fix For: 3.2.0 > > Time Spent: 20m > Remaining Estimate: 0h > > We discovered a problem when writing tests with high concurrency. It often > happens that a follower is running well and then triggers takeSnalshot. > The following is the relevant log. > follower: (as the follower log says, between 2024/08/22 20:18:14,044 and > 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the > follower election was not triggered, indicating that the leader gave The > heartbeat sent by the follower is successful) > {code:java} > 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO > org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly > 22436696498 -> 22441096501 > 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: > node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment > /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615 > 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax > old=22432683959, new=22437078979, updated? true > 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO > com.xxx.RaftJournalManager: Received install snapshot notification from > MetaStore leader: node3 with term index: (t:192, i:22441477801) > 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO > com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from > Leader MetaStore node3. Checkpoint address: leader:8170 > 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, > i:22441477801) > 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/22 20:21:57,067 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed > appendEntries as snapshot (22441477801) installation is in progress > 2024/08/22 20:21:57,068 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > inconsistency entries. > Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code} > leader: > {code:java} > 2024/08/22 20:18:16,958 [timer5] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, > i:22441098598)...(t:192, i:22441098622) > 2024/08/22 20:18:16,964 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, > i:22441098624) > 2024/08/22 20:18:16,964 [timer6] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, > i:22441098625) > 2024/08/22 20:18:16,964 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, > i:22441098623) > 2024/08/22 20:18:16,965 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192, > i:22441098627) > 2024/08/22 20:18:16,965 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appe
[jira] [Commented] (RATIS-2145) Follower hangs until the next trigger to take a snapshot
[ https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877751#comment-17877751 ] Tsz-wo Sze commented on RATIS-2145: --- bq. What's the problem if it's not the lowest index? can't we ensure this problem through follower's checkInconsistentAppendEntries? You are right that checkInconsistentAppendEntries ensures follower log entries are in correct order. It seems also okay if it is not the lowest index. The leader will just miss the reply so it won't update indices. As long as, the leader gets a SUCCESS reply, it knows that all the previous entries are correct. That's a good idea! Could submit a pull request? > Follower hangs until the next trigger to take a snapshot > - > > Key: RATIS-2145 > URL: https://issues.apache.org/jira/browse/RATIS-2145 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Assignee: guangbao zhao >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We discovered a problem when writing tests with high concurrency. It often > happens that a follower is running well and then triggers takeSnalshot. > The following is the relevant log. > follower: (as the follower log says, between 2024/08/22 20:18:14,044 and > 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the > follower election was not triggered, indicating that the leader gave The > heartbeat sent by the follower is successful) > {code:java} > 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO > org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly > 22436696498 -> 22441096501 > 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: > node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment > /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615 > 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax > old=22432683959, new=22437078979, updated? true > 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO > com.xxx.RaftJournalManager: Received install snapshot notification from > MetaStore leader: node3 with term index: (t:192, i:22441477801) > 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO > com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from > Leader MetaStore node3. Checkpoint address: leader:8170 > 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, > i:22441477801) > 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/22 20:21:57,067 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed > appendEntries as snapshot (22441477801) installation is in progress > 2024/08/22 20:21:57,068 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > inconsistency entries. > Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code} > leader: > {code:java} > 2024/08/22 20:18:16,958 [timer5] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, > i:22441098598)...(t:192, i:22441098622) > 2024/08/22 20:18:16,964 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, > i:22441098624) > 2024/08/22 20:18:16,964 [timer6] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, > i:22441098625) > 2024/08/22 20:18:16,964 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, > i:22441098623) > 2
[jira] [Commented] (RATIS-2145) Follower hangs until the next trigger to take a snapshot
[ https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877570#comment-17877570 ] Tsz-wo Sze commented on RATIS-2145: --- Hi [~z-bb], thanks a lot for filing this bug! Some questions/comments: bq. Because the leader did not receive the onNext callback within the requestTimeoutDuration(3s)time ... Why? Gc? Network issues? It may need a longer timeout. Apache Ozone uses 30s timeout by default. bq. ... Is it possible to replace handleTimeout with the remove method here, ... I guess you mean to remove the request from pendingRequests when timeout? It probably is okay if the request is the request with the lowest index in pendingRequests. > Follower hangs until the next trigger to take a snapshot > - > > Key: RATIS-2145 > URL: https://issues.apache.org/jira/browse/RATIS-2145 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Assignee: guangbao zhao >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We discovered a problem when writing tests with high concurrency. It often > happens that a follower is running well and then triggers takeSnalshot. > The following is the relevant log. > follower: (as the follower log says, between 2024/08/22 20:18:14,044 and > 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the > follower election was not triggered, indicating that the leader gave The > heartbeat sent by the follower is successful) > {code:java} > 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO > org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly > 22436696498 -> 22441096501 > 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: > node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment > /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615 > 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax > old=22432683959, new=22437078979, updated? true > 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO > com.xxx.RaftJournalManager: Received install snapshot notification from > MetaStore leader: node3 with term index: (t:192, i:22441477801) > 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO > com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from > Leader MetaStore node3. Checkpoint address: leader:8170 > 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, > i:22441477801) > 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/22 20:21:57,067 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed > appendEntries as snapshot (22441477801) installation is in progress > 2024/08/22 20:21:57,068 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > inconsistency entries. > Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code} > leader: > {code:java} > 2024/08/22 20:18:16,958 [timer5] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, > i:22441098598)...(t:192, i:22441098622) > 2024/08/22 20:18:16,964 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, > i:22441098624) > 2024/08/22 20:18:16,964 [timer6] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, > i:22441098625) > 2024/08/22 20:18:16,964 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867245,entriesCount=1,ent
[jira] [Commented] (HDDS-11382) Remove the use of caniuse-lite
[ https://issues.apache.org/jira/browse/HDDS-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877483#comment-17877483 ] Tsz-wo Sze commented on HDDS-11382: --- Note that caniuse-api (MIT) is okay to use but caniuse-lite (CC 4.0) is not. - caniuse-lite (CC 4.0); see LEGAL-678 -* https://github.com/browserslist/caniuse-lite/blob/main/LICENSE - caniuse-api (MIT) -* https://github.com/Nyalab/caniuse-api/blob/master/LICENSE > Remove the use of caniuse-lite > -- > > Key: HDDS-11382 > URL: https://issues.apache.org/jira/browse/HDDS-11382 > Project: Apache Ozone > Issue Type: Task > Components: Ozone Recon >Reporter: Tsz-wo Sze >Priority: Blocker > > After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite > - > ./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone-recon.txt > {code} > 4491: The following software may be included in this product: caniuse-lite. A > copy of the source code may be downloaded from > https://github.com/ben-eb/caniuse-lite.git. This software contains the > following license and notice below: >1 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/LICENSE > {code} > 4491: The following software may be included in this product: caniuse-lite. A > copy of the source code may be downloaded from > https://github.com/ben-eb/caniuse-lite.git. This software contains the > following license and notice below: >1 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/cjs-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/esm-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/ts-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/cjs-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/esm-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/ts-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/cjs-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/esm-use/package-lock.json > {code} > 2043: "caniuse-lite": "^1.0.30001208", > 2140: "caniuse-lite": { > 2142: "resolved": > "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, >3 occurrence(s) > {code} > - > ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encodin
[jira] [Updated] (HDDS-11382) Remove the use of caniuse-lite
[ https://issues.apache.org/jira/browse/HDDS-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated HDDS-11382: -- Description: After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite - ./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone-recon.txt {code} 4491: The following software may be included in this product: caniuse-lite. A copy of the source code may be downloaded from https://github.com/ben-eb/caniuse-lite.git. This software contains the following license and notice below: 1 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/LICENSE {code} 4491: The following software may be included in this product: caniuse-lite. A copy of the source code may be downloaded from https://github.com/ben-eb/caniuse-lite.git. This software contains the following license and notice below: 1 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/cjs-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/esm-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/ts-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/cjs-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/esm-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/ts-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/cjs-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/esm-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/ts-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/target/classes/webapps/recon/ozone-recon-web/LICENSE -- {code} 4491: The following software may be included in this product: caniuse-lite. A copy of the source code may be downloaded from https://github.com/ben-eb/caniuse-lite.git. This software contains the following license and notice below: 1 occurrence(s) {code} Totally 30 occurrence(s) in 305545 file(s). was: After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite - ./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone
[jira] [Commented] (HDDS-11368) Remove babel dependencies from Recon
[ https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877481#comment-17877481 ] Tsz-wo Sze commented on HDDS-11368: --- [~abhishek.pal], thanks a lot for fixing this! After this, we still have caniuse-lite; filed HDDS-11382. > Remove babel dependencies from Recon > > > Key: HDDS-11368 > URL: https://issues.apache.org/jira/browse/HDDS-11368 > Project: Apache Ozone > Issue Type: Task > Components: Ozone Recon >Affects Versions: 1.4.0 >Reporter: Abhishek Pal >Assignee: Abhishek Pal >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.0 > > > *caniuse-lite* is currently being imported as a part of babel, which is > internally used by vitejs/plugin-react. > Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be > used in our projects. > This JIRA is to track the removal of the dependency from Recon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Created] (HDDS-11382) Remove the use of caniuse-lite
Tsz-wo Sze created HDDS-11382: - Summary: Remove the use of caniuse-lite Key: HDDS-11382 URL: https://issues.apache.org/jira/browse/HDDS-11382 Project: Apache Ozone Issue Type: Task Components: Ozone Recon Reporter: Tsz-wo Sze After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite - ./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone-recon.txt {code} 4491: The following software may be included in this product: caniuse-lite. A copy of the source code may be downloaded from https://github.com/ben-eb/caniuse-lite.git. This software contains the following license and notice below: 1 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/LICENSE {code} 4491: The following software may be included in this product: caniuse-lite. A copy of the source code may be downloaded from https://github.com/ben-eb/caniuse-lite.git. This software contains the following license and notice below: 1 occurrence(s) - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/cjs-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/esm-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/ts-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/cjs-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/esm-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/ts-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/cjs-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/esm-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/ts-use/package-lock.json {code} 2043: "caniuse-lite": "^1.0.30001208", 2140: "caniuse-lite": { 2142: "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";, 3 occurrence(s) {code} - ./hadoop-ozone/recon/target/classes/webapps/recon/ozone-recon-web/LICENSE -- {code} 4491: The following software may be included in this product: caniuse-lite. A copy of the source code may be downloaded from https://github.com/ben-eb/caniuse-lite.git. This software contains the following license and notice below: 1 occurrence(s) {code} Totally 30 occurrence(s) in 305545 file(s). -- This messag
[jira] [Assigned] (HDDS-11375) DN Startup fails with Illegal configuration
[ https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned HDDS-11375: - Assignee: Wei-Chiu Chuang (was: Tsz-wo Sze) [~weichiu], thanks a lot for fixing this! > DN Startup fails with Illegal configuration > --- > > Key: HDDS-11375 > URL: https://issues.apache.org/jira/browse/HDDS-11375 > Project: Apache Ozone > Issue Type: Bug > Components: Ozone Datanode >Reporter: Pratyush Bhatt >Assignee: Wei-Chiu Chuang >Priority: Major > Labels: pull-request-available > Fix For: 1.5.0 > > > This is a problem if Ozone is upgraded to the latest (unreleased) Ratis code > base. Ozone currently is using Ratis 3.1.0 which does not have this problem. > All Ozone DN startup is failing with below error: > {code:java} > 2024-08-27 15:54:46,040 ERROR > [main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in > HddsDatanodeService. > java.lang.RuntimeException: Can't start the HDDS datanode plugin > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159) > Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal > configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m > (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432). > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56) > at > org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196) > at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal configuration: > raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger > than raft.server.log.appender.buffer.byte-limit(= 33554432). > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184) > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152) > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57) > at > org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111) > at > org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133) > at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40) > at > org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212) > at > org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74) > at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212) > at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225) > at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212) > at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204) > at > org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73) > at > org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Nati
[jira] [Commented] (RATIS-2145) Follower hangs until the next trigger to take a snapshot
[ https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877227#comment-17877227 ] Tsz-wo Sze commented on RATIS-2145: --- Sure, will review this tomorrow. > Follower hangs until the next trigger to take a snapshot > - > > Key: RATIS-2145 > URL: https://issues.apache.org/jira/browse/RATIS-2145 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Assignee: guangbao zhao >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We discovered a problem when writing tests with high concurrency. It often > happens that a follower is running well and then triggers takeSnalshot. > The following is the relevant log. > follower: (as the follower log says, between 2024/08/22 20:18:14,044 and > 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the > follower election was not triggered, indicating that the leader gave The > heartbeat sent by the follower is successful) > {code:java} > 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO > org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly > 22436696498 -> 22441096501 > 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: > node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment > /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615 > 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] > INFO org.apache.ratis.server.raftlog.RaftLog: > node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax > old=22432683959, new=22437078979, updated? true > 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO > com.xxx.RaftJournalManager: Received install snapshot notification from > MetaStore leader: node3 with term index: (t:192, i:22441477801) > 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO > com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from > Leader MetaStore node3. Checkpoint address: leader:8170 > 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, > i:22441477801) > 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/22 20:21:57,067 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed > appendEntries as snapshot (22441477801) installation is in progress > 2024/08/22 20:21:57,068 [node1-server-thread55] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > inconsistency entries. > Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code} > leader: > {code:java} > 2024/08/22 20:18:16,958 [timer5] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, > i:22441098598)...(t:192, i:22441098622) > 2024/08/22 20:18:16,964 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, > i:22441098624) > 2024/08/22 20:18:16,964 [timer6] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, > i:22441098625) > 2024/08/22 20:18:16,964 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, > i:22441098623) > 2024/08/22 20:18:16,965 [timer3] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192, > i:22441098627) > 2024/08/22 20:18:16,965 [timer7] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, > errorCount=1, > request=Appe
[jira] [Assigned] (HDDS-11375) DN Startup fails with Illegal configuration
[ https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned HDDS-11375: - Assignee: Tsz-wo Sze Description: This is a problem if Ozone is upgraded to the latest (unreleased) Ratis code base. Ozone currently is using Ratis 3.1.0 which does not have this problem. All Ozone DN startup is failing with below error: {code:java} 2024-08-27 15:54:46,040 ERROR [main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in HddsDatanodeService. java.lang.RuntimeException: Can't start the HDDS datanode plugin at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336) at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209) at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177) at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95) at picocli.CommandLine.executeUserObject(CommandLine.java:2045) at picocli.CommandLine.access$1500(CommandLine.java:148) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) at picocli.CommandLine.execute(CommandLine.java:2174) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91) at org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159) Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432). at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56) at org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196) at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210) at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214) at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533) at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209) at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183) at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291) ... 14 more Caused by: java.lang.IllegalArgumentException: Illegal configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432). at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184) at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152) at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57) at org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111) at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133) at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40) at org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212) at org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74) at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212) at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225) at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212) at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204) at org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73) at org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:191) ... 20 more 2024-08-27 15:54:46,045 INFO [shutdown-hook-0]-org.apache.hadoop.ozone.HddsDatanodeService: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down HddsDatanodeService at host/xx.xx.xx.xx{code} was: All Ozone DN startup is failing with below error: {code:java} 2024-08-27 15:54:46,040 ERROR [main]-org.apache.hadoop.oz
[jira] [Commented] (HDDS-11375) DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin"
[ https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877112#comment-17877112 ] Tsz-wo Sze commented on HDDS-11375: --- [~pratyush.bhatt], could provide the commit id of your code base? > DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin" > -- > > Key: HDDS-11375 > URL: https://issues.apache.org/jira/browse/HDDS-11375 > Project: Apache Ozone > Issue Type: Bug > Components: Ozone Datanode >Reporter: Pratyush Bhatt >Priority: Major > > All Ozone DN startup is failing with below error: > {code:java} > 2024-08-27 15:54:46,040 ERROR > [main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in > HddsDatanodeService. > java.lang.RuntimeException: Can't start the HDDS datanode plugin > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159) > Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal > configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m > (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432). > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56) > at > org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196) > at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal configuration: > raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger > than raft.server.log.appender.buffer.byte-limit(= 33554432). > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184) > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152) > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57) > at > org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111) > at > org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133) > at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40) > at > org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212) > at > org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74) > at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212) > at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225) > at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212) > at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204) > at > org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73) > at > org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.refle
[jira] [Comment Edited] (HDDS-11375) DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin"
[ https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877112#comment-17877112 ] Tsz-wo Sze edited comment on HDDS-11375 at 8/27/24 5:27 PM: [~pratyush.bhatt], could you provide the commit id of your code base? I will take a look. was (Author: szetszwo): [~pratyush.bhatt], could provide the commit id of your code base? > DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin" > -- > > Key: HDDS-11375 > URL: https://issues.apache.org/jira/browse/HDDS-11375 > Project: Apache Ozone > Issue Type: Bug > Components: Ozone Datanode >Reporter: Pratyush Bhatt >Priority: Major > > All Ozone DN startup is failing with below error: > {code:java} > 2024-08-27 15:54:46,040 ERROR > [main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in > HddsDatanodeService. > java.lang.RuntimeException: Can't start the HDDS datanode plugin > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177) > at > org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100) > at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91) > at > org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159) > Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal > configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m > (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432). > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56) > at > org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196) > at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183) > at > org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal configuration: > raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger > than raft.server.log.appender.buffer.byte-limit(= 33554432). > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184) > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152) > at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57) > at > org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111) > at > org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133) > at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40) > at > org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212) > at > org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74) > at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212) > at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225) > at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212) > at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204) > at > org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73) > at > org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccess
[jira] [Updated] (HDDS-11368) Remove babel dependencies from Recon
[ https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated HDDS-11368: -- Fix Version/s: (was: 1.5.0) (was: 1.4.1) (was: 2.0.0) Target Version/s: 1.5.0, 1.4.1, 2.0.0 > Remove babel dependencies from Recon > > > Key: HDDS-11368 > URL: https://issues.apache.org/jira/browse/HDDS-11368 > Project: Apache Ozone > Issue Type: Task > Components: Ozone Recon >Affects Versions: 1.4.0 >Reporter: Abhishek Pal >Assignee: Abhishek Pal >Priority: Blocker > Labels: pull-request-available > > *caniuse-lite* is currently being imported as a part of babel, which is > internally used by vitejs/plugin-react. > Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be > used in our projects. > This JIRA is to track the removal of the dependency from Recon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Resolved] (RATIS-2132) Revert RATIS-2099 due to its performance regression
[ https://issues.apache.org/jira/browse/RATIS-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2132. --- Resolution: Done Reverted RATIS-2099 and RATIS-2101. Thanks, [~duongnguyen] for reporting this! {code} commit 520ecab157c9c6aff87ed5c5978c08f98cd4ec6c (origin/master, origin/HEAD, master) Author: Tsz-Wo Nicholas Sze Date: Mon Aug 26 11:07:29 2024 -0700 Revert "RATIS-2099. Cache TermIndexImpl instead of using anonymous class (#1100)" This reverts commit 428ce4ae3d5a0349f3425cb85ef1a3d38dea24b1. {code} {code} commit da5d508caffc4ca90b0ab962b5105785b9774daa Author: Tsz-Wo Nicholas Sze Date: Mon Aug 26 11:07:17 2024 -0700 Revert "RATIS-2101. Move TermIndex.PRIVATE_CACHE to Util.CACHE (#1103)" This reverts commit 93eb32a8620fdd4e5119592ef32bc50590810c7b. {code} > Revert RATIS-2099 due to its performance regression > --- > > Key: RATIS-2132 > URL: https://issues.apache.org/jira/browse/RATIS-2132 > Project: Ratis > Issue Type: Sub-task >Reporter: Duong >Assignee: Duong >Priority: Major > Attachments: Screenshot 2024-07-29 at 5.07.32 PM.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > This commit creates a significant extra cost in the critical path (which is > run sequentially) of Ratis appendTransaction. > !Screenshot 2024-07-29 at 5.07.32 PM.png|width=981,height=479! > This seems to be a premature optimization. One or two instances of TermIndex > per request are basically nothing (unless we create hundreds/thousands of > them per request). Short-lived POJO like this are the best to be dealt with > by java GC/heap. > More details are the parent Jira RATIS-2129. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HDDS-11368) Remove babel dependencies from Recon
[ https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated HDDS-11368: -- Fix Version/s: 1.5.0 1.4.1 2.0.0 Affects Version/s: (was: 1.5.0) Priority: Blocker (was: Critical) Setting this as blocker of 1.4.1 and 2.0.0 > Remove babel dependencies from Recon > > > Key: HDDS-11368 > URL: https://issues.apache.org/jira/browse/HDDS-11368 > Project: Apache Ozone > Issue Type: Task > Components: Ozone Recon >Affects Versions: 1.4.0 >Reporter: Abhishek Pal >Assignee: Abhishek Pal >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.0, 1.4.1, 2.0.0 > > > *caniuse-lite* is currently being imported as a part of babel, which is > internally used by vitejs/plugin-react. > Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be > used in our projects. > This JIRA is to track the removal of the dependency from Recon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Comment Edited] (HDDS-11368) Remove babel dependencies from Recon
[ https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876794#comment-17876794 ] Tsz-wo Sze edited comment on HDDS-11368 at 8/26/24 5:16 PM: Setting this as blocker of 1.4.1, 1.5.0 and 2.0.0 was (Author: szetszwo): Setting this as blocker of 1.4.1 and 2.0.0 > Remove babel dependencies from Recon > > > Key: HDDS-11368 > URL: https://issues.apache.org/jira/browse/HDDS-11368 > Project: Apache Ozone > Issue Type: Task > Components: Ozone Recon >Affects Versions: 1.4.0 >Reporter: Abhishek Pal >Assignee: Abhishek Pal >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.0, 1.4.1, 2.0.0 > > > *caniuse-lite* is currently being imported as a part of babel, which is > internally used by vitejs/plugin-react. > Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be > used in our projects. > This JIRA is to track the removal of the dependency from Recon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Resolved] (RATIS-2143) Off-heap memory oom issue in SegmentedRaftLogReader
[ https://issues.apache.org/jira/browse/RATIS-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2143. --- Resolution: Invalid [~weiming], thanks for checking it! Resolving ... > Off-heap memory oom issue in SegmentedRaftLogReader > --- > > Key: RATIS-2143 > URL: https://issues.apache.org/jira/browse/RATIS-2143 > Project: Ratis > Issue Type: Bug >Affects Versions: 3.0.1 >Reporter: weiming >Priority: Major > Attachments: image-2024-08-21-15-17-45-705.png, > image-2024-08-21-15-41-00-261.png, image-2024-08-22-11-26-01-822.png > > > In our ozone cluster, a DN was found in the SCM page to be in the DEAD state. > When restarting, the DN could not start normally, and an off-heap memory OOM > was found in the log. > > ENV: > ratis version release-3.0.1 > > JDK: > openjdk 17.0.2 2022-01-18 > OpenJDK Runtime Environment (build 17.0.2+8-86) > OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing) > > Ozone DN JVM param: > {code:java} > //代码占位符 > export OZONE_DATANODE_OPTS="-Xms24g -Xmx48g -Xmn16g -XX:MetaspaceSize=512m > -XX:MaxDirectMemorySize=48g -XX:+UseG1GC -XX:MaxGCPauseMillis=60 > -XX:ParallelGCThreads=32 -XX:ConcGCThreads=16 -XX:+AlwaysPreTouc > h -XX:+TieredCompilation -XX:+UseStringDeduplication > -XX:+OptimizeStringConcat -XX:G1HeapRegionSize=32M > -XX:+ParallelRefProcEnabled -XX:ReservedCodeCacheSize=1024M > -XX:+UnlockExperimentalVMOptions -XX:G1M > ixedGCLiveThresholdPercent=85 -XX:G1HeapWastePercent=10 > -XX:InitiatingHeapOccupancyPercent=40 -XX:-G1UseAdaptiveIHOP -verbose:gc > -XX:+PrintGCDetails -XX:+PrintGC -XX:+ExitOnOutOfMemoryError -Dorg.apache.r > atis.thirdparty.io.netty.tryReflectionSetAccessible=true > -Xlog:gc*=info:file=${OZONE_LOG_DIR}/dn_gc-%p.log:time,level,tags:filecount=50,filesize=100M > -XX:NativeMemoryTracking=detail " {code} > > ERROR LOG: > > java.lang.OutOfMemoryError: Cannot reserve 8192 bytes of direct buffer memory > (allocated: 51539599490, limit: 51539607552) > at java.base/java.nio.Bits.reserveMemory(Bits.java:178) > at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:121) > at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:332) > at java.base/sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:243) > at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:293) > at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:273) > at java.base/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:232) > at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) > at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:107) > at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:101) > at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244) > at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:343) > at java.base/java.io.FilterInputStream.read(FilterInputStream.java:132) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader$LimitedInputStream.read(SegmentedRaftLogReader.java:96) > at java.base/java.io.DataInputStream.read(DataInputStream.java:151) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.verifyHeader(SegmentedRaftLogReader.java:172) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.init(SegmentedRaftLogInputStream.java:95) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:122) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:131) > at > org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:236) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:346) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:295) > at > org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:236) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:186) > at java.base/java.lang.Thread.run(Thread.java:833) > !image-2024-08-21-15-17-45-705.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HDDS-11360) NPE in OMRatisHelper
[ https://issues.apache.org/jira/browse/HDDS-11360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876326#comment-17876326 ] Tsz-wo Sze commented on HDDS-11360: --- [~jianghuazhu], we should find out why reply.getMessage() is null. > NPE in OMRatisHelper > > > Key: HDDS-11360 > URL: https://issues.apache.org/jira/browse/HDDS-11360 > Project: Apache Ozone > Issue Type: Improvement > Components: OM >Affects Versions: 1.4.0 >Reporter: JiangHua Zhu >Priority: Major > > Found some NullPointerException in OMRatisHelper. Here are some cases. > om ha switch: > {code:java} > 2024-08-23 13:41:35,785 [om22-server-thread545] INFO > org.apache.ratis.server.RaftServer$Division: om22@group-61C56C563FC9: receive > transferLeadership > TransferLeadershipRequest:client-CBC5546B4108->om22@group-61C56C563FC9, > cid=3, seq=null, RO, null > 2024-08-23 13:41:35,786 [om22-server-thread545] INFO > org.apache.ratis.server.impl.TransferLeadership: om22@group-61C56C563FC9: > start transferring leadership to om21 > 2024-08-23 13:41:35,787 [om22-server-thread545] INFO > org.apache.ratis.server.impl.TransferLeadership: om22@group-61C56C563FC9: > sendStartLeaderElection to follower om21, lastEntry=(t:77, i:12154700362) > 2024-08-23 13:41:35,787 [om22-server-thread545] INFO > org.apache.ratis.server.impl.TransferLeadership: om22@group-61C56C563FC9: > SUCCESS sent StartLeaderElection to transferee om21 immediately as it already > has up-to-date log > {code} > OMRatisHelper: > {code:java} > 2024-08-23 13:41:35,869 [IPC Server handler 113 on default port 9862] WARN > org.apache.hadoop.ipc.Server: IPC Server handler 113 on default port 9862, > call Call#8836 Retry#0 > org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from > xxx.xxx.xxx.xxx:33796 / xx.xx.xx.xx:33796 > java.lang.NullPointerException: Cannot invoke > "org.apache.ratis.protocol.Message.getContent()" because the return value of > "org.apache.ratis.protocol.RaftClientReply.getMessage()" is null > at > org.apache.hadoop.ozone.om.helpers.OMRatisHelper.getOMResponseFromRaftClientReply(OMRatisHelper.java:66) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:524) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$1(OzoneManagerRatisServer.java:279) > at > org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:277) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:257) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:257) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:236) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:172) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:163) > at > org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1098) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1021) > at > java.base/java.security.AccessController.doPrivileged(AccessController.java:712) > at java.base/javax.security.auth.Subject.doAs(Subject.java:439) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) > {code} > s3gateway log: > {code:java} > 2024-08-23 13:41:35,801 [qtp1396431506-4981] INFO > org.apache.hadoop.io.retry.RetryInvocationHandler: > com.google.protobuf.ServiceException: > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): Cannot > invoke "org.apache.ratis.protocol.Message.getCon
[jira] [Resolved] (RATIS-2113) Use consistent method names and parameter types in RaftUtils
[ https://issues.apache.org/jira/browse/RATIS-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2113. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. > Use consistent method names and parameter types in RaftUtils > - > > Key: RATIS-2113 > URL: https://issues.apache.org/jira/browse/RATIS-2113 > Project: Ratis > Issue Type: Improvement > Components: shell >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Fix For: 3.2.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Since RaftUtils is going to be a public API, we should: > - Use consistent method names: getPeerId vs buildRaftPeersFromStr. > - Use consistent parameter types: PrintStream vs Consumer > - Remove duplicated AbstractCommand.parseInetSocketAddress -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2140) Thread wait when installing snapshot
[ https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2140. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~z-bb]! > Thread wait when installing snapshot > > > Key: RATIS-2140 > URL: https://issues.apache.org/jira/browse/RATIS-2140 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Assignee: guangbao zhao >Priority: Major > Fix For: 3.2.0 > > Time Spent: 20m > Remaining Estimate: 0h > > hi, [~szetszwo] I found a problem. In our service, when the leader notify the > follower of InstallSnapshot, the leader may cause the GrcpAppender thread to > be in the wait state due to timing issues, causing the installation of the > snapshot to fail, and triggering the follower to not receive the leader's > heartbeat within the specified timeout period to trigger the election. > The last log that triggered the exception > node1: > {code:java} > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with > firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183 > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: send > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a > reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: > InstallSnapshot in progress. > 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-AppendLogResponseHandler: received > INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, > request=AppendEntriesRequest:cid=11690239,entriesCount=0 > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, > i:16857386182)) > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: > reject PRE_VOTE from node2: this server is the leader and still has > leadership > ...{code} > node2: > {code:java} > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed > appendEntries as snapshot (17159569079) installation is in progress > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > inconsistency entries. > Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: receive installSnapshot: > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: reply installSnapshot: > node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, > i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.FollowerState: > node2@group-4F53D3317400-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.RoleInfo: node2: shutdown > node2@group-4F53D3317400-FollowerState > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > changes role from FOLLOWER to CANDIDATE at term 139 for chan
[jira] [Assigned] (RATIS-2140) Thread wait when installing snapshot
[ https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned RATIS-2140: - Assignee: guangbao zhao > Thread wait when installing snapshot > > > Key: RATIS-2140 > URL: https://issues.apache.org/jira/browse/RATIS-2140 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Assignee: guangbao zhao >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > hi, [~szetszwo] I found a problem. In our service, when the leader notify the > follower of InstallSnapshot, the leader may cause the GrcpAppender thread to > be in the wait state due to timing issues, causing the installation of the > snapshot to fail, and triggering the follower to not receive the leader's > heartbeat within the specified timeout period to trigger the election. > The last log that triggered the exception > node1: > {code:java} > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with > firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183 > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: send > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a > reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: > InstallSnapshot in progress. > 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-AppendLogResponseHandler: received > INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, > request=AppendEntriesRequest:cid=11690239,entriesCount=0 > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, > i:16857386182)) > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: > reject PRE_VOTE from node2: this server is the leader and still has > leadership > ...{code} > node2: > {code:java} > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed > appendEntries as snapshot (17159569079) installation is in progress > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > inconsistency entries. > Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: receive installSnapshot: > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: reply installSnapshot: > node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, > i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.FollowerState: > node2@group-4F53D3317400-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.RoleInfo: node2: shutdown > node2@group-4F53D3317400-FollowerState > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > changes role from FOLLOWER to CANDIDATE at term 139 for changeToCandidate > ...{code} > node2 grpc thread stack: > {code:java} > jstack 118659 | grep -A 12 >
[jira] [Resolved] (RATIS-2144) SegmentedRaftLogWorker should close the stream before releasing the buffer.
[ https://issues.apache.org/jira/browse/RATIS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2144. --- Fix Version/s: 3.2.0 Assignee: Xinyu Tan Resolution: Fixed The pull request is now merged. [~tanxinyu], thanks for the quick fix! > SegmentedRaftLogWorker should close the stream before releasing the buffer. > --- > > Key: RATIS-2144 > URL: https://issues.apache.org/jira/browse/RATIS-2144 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Tsz-wo Sze >Assignee: Xinyu Tan >Priority: Major > Fix For: 3.2.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > In the code below, frees the buffer first and cleans up out. The buffer > content can be corrupted and then be flushed to out. > {code} > //SegmentedRaftLogWorker > void close() { > ... > PlatformDependent.freeDirectBuffer(writeBuffer); > IOUtils.cleanup(LOG, out); > LOG.info("{} close()", name); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2140) Thread wait when installing snapshot
[ https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2140: -- Component/s: gRPC > Thread wait when installing snapshot > > > Key: RATIS-2140 > URL: https://issues.apache.org/jira/browse/RATIS-2140 > Project: Ratis > Issue Type: Bug > Components: gRPC >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > hi, [~szetszwo] I found a problem. In our service, when the leader notify the > follower of InstallSnapshot, the leader may cause the GrcpAppender thread to > be in the wait state due to timing issues, causing the installation of the > snapshot to fail, and triggering the follower to not receive the leader's > heartbeat within the specified timeout period to trigger the election. > The last log that triggered the exception > node1: > {code:java} > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with > firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183 > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: send > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a > reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: > InstallSnapshot in progress. > 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-AppendLogResponseHandler: received > INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, > request=AppendEntriesRequest:cid=11690239,entriesCount=0 > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, > i:16857386182)) > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: > reject PRE_VOTE from node2: this server is the leader and still has > leadership > ...{code} > node2: > {code:java} > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed > appendEntries as snapshot (17159569079) installation is in progress > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > inconsistency entries. > Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: receive installSnapshot: > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: reply installSnapshot: > node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, > i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.FollowerState: > node2@group-4F53D3317400-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.RoleInfo: node2: shutdown > node2@group-4F53D3317400-FollowerState > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > changes role from FOLLOWER to CANDIDATE at term 139 for changeToCandidate > ...{code} > node2 grpc thread stack: > {code:java} > jstack 118659 | grep -A 12 > node2-GrpcLogAppender-LogAppenderDaemon"node1@grou
[jira] [Commented] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
[ https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875681#comment-17875681 ] Tsz-wo Sze commented on HDDS-11352: --- Filed RATIS-2144. > Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes > -- > > Key: HDDS-11352 > URL: https://issues.apache.org/jira/browse/HDDS-11352 > Project: Apache Ozone > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Ethan Rose >Priority: Critical > Attachments: it-om.zip > > > Failure observed in [this > run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] > in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be > specific to that test in particular. > {code} > --- > Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > --- > Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s > <<< FAILURE! - in > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown Time > elapsed: 18.461 s <<< ERROR! > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > omNode-1@group-523986131536: Failed to initRaftLog. > at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) > at > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) > at > java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498) > at > java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) > at > org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206) > at > org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:840) > Caused by: java.lang.IllegalStateException: omNode-1@group-523986131536: > Failed to initRaftLog. > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:171) > at > org.apache.ratis.server.impl.ServerState.lambda$new$6(ServerState.java:131) > at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:63) > at > org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:148) > at > org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:385) > at > org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203) > ... 4 more > Caused by: org.apache.ratis.protocol.exceptions.ChecksumException: Log entry > corrupted: Calculated checksum is 3AB532B2 but read checksum is 31120F6C. > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:319) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:204) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:131) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:172) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:428) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:258) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:231) > at > org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:273) > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:194) > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:168) > ... 9 more > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.testListVolumes > Time elapsed: 121.075 s <<< ERROR! > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e
[jira] [Comment Edited] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
[ https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678 ] Tsz-wo Sze edited comment on HDDS-11352 at 8/21/24 11:00 PM: - {code} 2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:execute(637)) - omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107 2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - omNode-1@group-523986131536-SegmentedRaftLogWorker close() {code} In the log, SegmentedRaftLogWorker created new log segment and calling close() in two different threads about the same time. Checked the code below, it frees the buffer first and cleans up out. The buffer content can be corrupted and then be flushed to out. {code} //SegmentedRaftLogWorker void close() { ... PlatformDependent.freeDirectBuffer(writeBuffer); IOUtils.cleanup(LOG, out); LOG.info("{} close()", name); } {code} was (Author: szetszwo): {code} 2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:execute(637)) - omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107 2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - omNode-1@group-523986131536-SegmentedRaftLogWorker close() {code} In the log, SegmentedRaftLogWorker created new log segment and calling close() in two different threads about the same time. Checked the code below, it frees the buffer first and cleans up out. The buffer content can be corrupted and then be flushed to out. It is recent change by RATIS-2065. {code} //SegmentedRaftLogWorker void close() { ... PlatformDependent.freeDirectBuffer(writeBuffer); IOUtils.cleanup(LOG, out); LOG.info("{} close()", name); } {code} > Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes > -- > > Key: HDDS-11352 > URL: https://issues.apache.org/jira/browse/HDDS-11352 > Project: Apache Ozone > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Ethan Rose >Priority: Critical > Attachments: it-om.zip > > > Failure observed in [this > run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] > in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be > specific to that test in particular. > {code} > --- > Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > --- > Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s > <<< FAILURE! - in > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown Time > elapsed: 18.461 s <<< ERROR! > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > omNode-1@group-523986131536: Failed to initRaftLog. > at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) > at > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) > at > java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498) > at > java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) > at > org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206) > at > org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Threa
[jira] [Commented] (RATIS-2065) Avoid the out-of-heap memory OOM phenomenon of frequent creation and deletion of Raft group scenarios
[ https://issues.apache.org/jira/browse/RATIS-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875680#comment-17875680 ] Tsz-wo Sze commented on RATIS-2065: --- [~tanxinyu], it looks there is a bug; see RATIS-2144 . > Avoid the out-of-heap memory OOM phenomenon of frequent creation and deletion > of Raft group scenarios > - > > Key: RATIS-2065 > URL: https://issues.apache.org/jira/browse/RATIS-2065 > Project: Ratis > Issue Type: Improvement > Components: server >Reporter: Xinyu Tan >Assignee: Xinyu Tan >Priority: Major > Fix For: 3.1.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The current SegmentedRaftLogWorker will create one when it's created > [DirectBuffer|https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L209], > in the end will not take the initiative to release it. This requires the > corresponding in-heap memory to be released by the GC before the > corresponding off-heap memory can be freed. > For frequent Raft Group creation and deletion scenarios, there may be plenty > of in-heap memory that will not trigger GC, but the out-of-heap memory will > be occupied by these deprecated DirectBuffers, and the out-of-heap memory OOM > phenomenon will eventually occur. > In IoTDB, We will > [explicitly|https://github.com/apache/iotdb/blob/master/iotdb-core/datanode/src/main/java/org/apache/iotdb/db/utils/MmapUtil.java#L33] > release outside the heap memory, thus avoiding a similar situation. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (RATIS-2144) SegmentedRaftLogWorker should close the stream before releasing the buffer.
Tsz-wo Sze created RATIS-2144: - Summary: SegmentedRaftLogWorker should close the stream before releasing the buffer. Key: RATIS-2144 URL: https://issues.apache.org/jira/browse/RATIS-2144 Project: Ratis Issue Type: Bug Components: server Reporter: Tsz-wo Sze In the code below, frees the buffer first and cleans up out. The buffer content can be corrupted and then be flushed to out. {code} //SegmentedRaftLogWorker void close() { ... PlatformDependent.freeDirectBuffer(writeBuffer); IOUtils.cleanup(LOG, out); LOG.info("{} close()", name); } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
[ https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678 ] Tsz-wo Sze edited comment on HDDS-11352 at 8/21/24 10:49 PM: - {code} 2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:execute(637)) - omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107 2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - omNode-1@group-523986131536-SegmentedRaftLogWorker close() {code} In the log, SegmentedRaftLogWorker created new log segment and calling close() in two different threads about the same time. Checked the code below, it frees the buffer first and cleans up out. The buffer content can be corrupted and then be flushed to out. It is recent change by RATIS-2065. {code} //SegmentedRaftLogWorker void close() { ... PlatformDependent.freeDirectBuffer(writeBuffer); IOUtils.cleanup(LOG, out); LOG.info("{} close()", name); } {code} was (Author: szetszwo): {code} 2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:execute(637)) - omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107 2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - omNode-1@group-523986131536-SegmentedRaftLogWorker close() {code} In the log, SegmentedRaftLogWorker created new log segment and calling close() in two different threads about the same time. Checked the code below, it frees the buffer first and call close(). The buffer content can be corrupted. It is recent change by RATIS-2065. {code} //SegmentedRaftLogWorker void close() { ... PlatformDependent.freeDirectBuffer(writeBuffer); IOUtils.cleanup(LOG, out); LOG.info("{} close()", name); } {code} > Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes > -- > > Key: HDDS-11352 > URL: https://issues.apache.org/jira/browse/HDDS-11352 > Project: Apache Ozone > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Ethan Rose >Priority: Critical > Attachments: it-om.zip > > > Failure observed in [this > run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] > in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be > specific to that test in particular. > {code} > --- > Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > --- > Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s > <<< FAILURE! - in > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown Time > elapsed: 18.461 s <<< ERROR! > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > omNode-1@group-523986131536: Failed to initRaftLog. > at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) > at > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) > at > java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498) > at > java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) > at > org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206) > at > org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.la
[jira] [Commented] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
[ https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678 ] Tsz-wo Sze commented on HDDS-11352: --- {code} 2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:execute(637)) - omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107 2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - omNode-1@group-523986131536-SegmentedRaftLogWorker close() {code} In the log, SegmentedRaftLogWorker created new log segment and calling close() in two different threads about the same time. Checked the code below, it frees the buffer first and call close(). The buffer content can be corrupted. It is recent change by RATIS-2065. {code} //SegmentedRaftLogWorker void close() { ... PlatformDependent.freeDirectBuffer(writeBuffer); IOUtils.cleanup(LOG, out); LOG.info("{} close()", name); } {code} > Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes > -- > > Key: HDDS-11352 > URL: https://issues.apache.org/jira/browse/HDDS-11352 > Project: Apache Ozone > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Ethan Rose >Priority: Critical > Attachments: it-om.zip > > > Failure observed in [this > run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] > in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be > specific to that test in particular. > {code} > --- > Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > --- > Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s > <<< FAILURE! - in > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes > org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown Time > elapsed: 18.461 s <<< ERROR! > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > omNode-1@group-523986131536: Failed to initRaftLog. > at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) > at > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) > at > java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498) > at > java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) > at > org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206) > at > org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:840) > Caused by: java.lang.IllegalStateException: omNode-1@group-523986131536: > Failed to initRaftLog. > at > org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:171) > at > org.apache.ratis.server.impl.ServerState.lambda$new$6(ServerState.java:131) > at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:63) > at > org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:148) > at > org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:385) > at > org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203) > ... 4 more > Caused by: org.apache.ratis.protocol.exceptions.ChecksumException: Log entry > corrupted: Calculated checksum is 3AB532B2 but read checksum is 31120F6C. > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:319) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:204) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:131) > at > or
[jira] [Updated] (RATIS-2142) OOM for stateMachineCache use cases
[ https://issues.apache.org/jira/browse/RATIS-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2142: -- Summary: OOM for stateMachineCache use cases (was: Memory leak for stateMachineCache use cases) > OOM for stateMachineCache use cases > --- > > Key: RATIS-2142 > URL: https://issues.apache.org/jira/browse/RATIS-2142 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Duong >Priority: Major > > In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a > reference to the original RaftClientRequest. This is not supposed to happen > as RaftLogCache entries should only refer to the LogEntries with data > truncated. > This problem impacts Apache Ozone. The reference form RaftLogCache entries > prevent the original RaftClientRequest (which contains a large data chunk) to > be GCed timely. The result is Ozone datanodes quickly run out of heap memory. > This is not the case with the latest master branch, only with the 3.1.0 > release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (RATIS-2142) OOM for stateMachineCache use cases
[ https://issues.apache.org/jira/browse/RATIS-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2142. --- Resolution: Duplicate Resolving this as a duplicate of RATIS-2141. > OOM for stateMachineCache use cases > --- > > Key: RATIS-2142 > URL: https://issues.apache.org/jira/browse/RATIS-2142 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Duong >Priority: Major > > In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a > reference to the original RaftClientRequest. This is not supposed to happen > as RaftLogCache entries should only refer to the LogEntries with data > truncated. > This problem impacts Apache Ozone. The reference form RaftLogCache entries > prevent the original RaftClientRequest (which contains a large data chunk) to > be GCed timely. The result is Ozone datanodes quickly run out of heap memory. > This is not the case with the latest master branch, only with the 3.1.0 > release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (RATIS-2141) OOM for stateMachineCache use cases
[ https://issues.apache.org/jira/browse/RATIS-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2141: -- Summary: OOM for stateMachineCache use cases (was: Memory leak for stateMachineCache use cases) [~duongnguyen], thanks for finding out this problem! "Memory leak" usually means that memory was allocated but not released; see - https://en.wikipedia.org/wiki/Memory_leak In this case, we are not having such problem. Our problem is unnecessarily using too much memory. Updating the Summary. > OOM for stateMachineCache use cases > --- > > Key: RATIS-2141 > URL: https://issues.apache.org/jira/browse/RATIS-2141 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Duong >Priority: Major > Attachments: RaftLogCache_entry.png, heap-dump.png > > > In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a > reference to the original RaftClientRequest. This is not supposed to happen > as RaftLogCache entries should only refer to the LogEntries with data > truncated, and RaftLogCache retention policy only counts the size of the > entries without data. > This problem impacts Apache Ozone. The reference form RaftLogCache entries > prevent the original RaftClientRequest (which contains a large data chunk) to > be GCed. The result is Ozone datanodes quickly run out of heap memory. > !heap-dump.png|width=1286,height=141! > !RaftLogCache_entry.png|width=730,height=272! > This is not the case with the latest master branch, only with the 3.1.0 > release. > The fix for this issue in 3.1.0 is as simple as > [6a141544c567a6325b05e2972cd426cdc14060cb|https://github.com/duongkame/ratis/commit/bcff74af0a5fa4b68af2267ce8dfa01f65a5445c]. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (RATIS-2141) Memory leak for stateMachineCache use cases
[ https://issues.apache.org/jira/browse/RATIS-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875564#comment-17875564 ] Tsz-wo Sze edited comment on RATIS-2141 at 8/21/24 4:25 PM: Let's revert RATIS-1983 from 3.1.0. I just have tried the revert. It only has some minor conflicts. was (Author: szetszwo): Let's revert RATIS-1983. I just have tried the revert. It only has some minor conflicts. > Memory leak for stateMachineCache use cases > --- > > Key: RATIS-2141 > URL: https://issues.apache.org/jira/browse/RATIS-2141 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Duong >Priority: Major > Attachments: RaftLogCache_entry.png, heap-dump.png > > > In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a > reference to the original RaftClientRequest. This is not supposed to happen > as RaftLogCache entries should only refer to the LogEntries with data > truncated, and RaftLogCache retention policy only counts the size of the > entries without data. > This problem impacts Apache Ozone. The reference form RaftLogCache entries > prevent the original RaftClientRequest (which contains a large data chunk) to > be GCed. The result is Ozone datanodes quickly run out of heap memory. > !heap-dump.png|width=1286,height=141! > !RaftLogCache_entry.png|width=730,height=272! > This is not the case with the latest master branch, only with the 3.1.0 > release. > The fix for this issue in 3.1.0 is as simple as > [6a141544c567a6325b05e2972cd426cdc14060cb|https://github.com/duongkame/ratis/commit/bcff74af0a5fa4b68af2267ce8dfa01f65a5445c]. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2141) Memory leak for stateMachineCache use cases
[ https://issues.apache.org/jira/browse/RATIS-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875564#comment-17875564 ] Tsz-wo Sze commented on RATIS-2141: --- Let's revert RATIS-1983. I just have tried the revert. It only has some minor conflicts. > Memory leak for stateMachineCache use cases > --- > > Key: RATIS-2141 > URL: https://issues.apache.org/jira/browse/RATIS-2141 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Duong >Priority: Major > Attachments: RaftLogCache_entry.png, heap-dump.png > > > In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a > reference to the original RaftClientRequest. This is not supposed to happen > as RaftLogCache entries should only refer to the LogEntries with data > truncated, and RaftLogCache retention policy only counts the size of the > entries without data. > This problem impacts Apache Ozone. The reference form RaftLogCache entries > prevent the original RaftClientRequest (which contains a large data chunk) to > be GCed. The result is Ozone datanodes quickly run out of heap memory. > !heap-dump.png|width=1286,height=141! > !RaftLogCache_entry.png|width=730,height=272! > This is not the case with the latest master branch, only with the 3.1.0 > release. > The fix for this issue in 3.1.0 is as simple as > [6a141544c567a6325b05e2972cd426cdc14060cb|https://github.com/duongkame/ratis/commit/bcff74af0a5fa4b68af2267ce8dfa01f65a5445c]. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2143) Off-heap memory oom issue in SegmentedRaftLogReader
[ https://issues.apache.org/jira/browse/RATIS-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875562#comment-17875562 ] Tsz-wo Sze commented on RATIS-2143: --- How about the number of pipelines? We saw some cases that there were hundreds of uncleaned up pipelines in a datanode, which caused an OOM. See Duong's reply in https://lists.apache.org/thread/dpo6tjmxy1n9gmc67jjjm7pon8txfyjb > Off-heap memory oom issue in SegmentedRaftLogReader > --- > > Key: RATIS-2143 > URL: https://issues.apache.org/jira/browse/RATIS-2143 > Project: Ratis > Issue Type: Bug >Affects Versions: 3.0.1 >Reporter: weiming >Priority: Major > Attachments: image-2024-08-21-15-17-45-705.png, > image-2024-08-21-15-41-00-261.png, image-2024-08-21-15-43-24-729.png > > > In our ozone cluster, a DN was found in the SCM page to be in the DEAD state. > When restarting, the DN could not start normally, and an off-heap memory OOM > was found in the log. > > ENV: > ratis version release-3.0.1 > > JDK: > openjdk 17.0.2 2022-01-18 > OpenJDK Runtime Environment (build 17.0.2+8-86) > OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing) > > Ozone DN JVM param: > {code:java} > //代码占位符 > export OZONE_DATANODE_OPTS="-Xms24g -Xmx48g -Xmn16g -XX:MetaspaceSize=512m > -XX:MaxDirectMemorySize=48g -XX:+UseG1GC -XX:MaxGCPauseMillis=60 > -XX:ParallelGCThreads=32 -XX:ConcGCThreads=16 -XX:+AlwaysPreTouc > h -XX:+TieredCompilation -XX:+UseStringDeduplication > -XX:+OptimizeStringConcat -XX:G1HeapRegionSize=32M > -XX:+ParallelRefProcEnabled -XX:ReservedCodeCacheSize=1024M > -XX:+UnlockExperimentalVMOptions -XX:G1M > ixedGCLiveThresholdPercent=85 -XX:G1HeapWastePercent=10 > -XX:InitiatingHeapOccupancyPercent=40 -XX:-G1UseAdaptiveIHOP -verbose:gc > -XX:+PrintGCDetails -XX:+PrintGC -XX:+ExitOnOutOfMemoryError -Dorg.apache.r > atis.thirdparty.io.netty.tryReflectionSetAccessible=true > -Xlog:gc*=info:file=${OZONE_LOG_DIR}/dn_gc-%p.log:time,level,tags:filecount=50,filesize=100M > -XX:NativeMemoryTracking=detail " {code} > > ERROR LOG: > > java.lang.OutOfMemoryError: Cannot reserve 8192 bytes of direct buffer memory > (allocated: 51539599490, limit: 51539607552) > at java.base/java.nio.Bits.reserveMemory(Bits.java:178) > at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:121) > at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:332) > at java.base/sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:243) > at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:293) > at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:273) > at java.base/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:232) > at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) > at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:107) > at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:101) > at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244) > at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:343) > at java.base/java.io.FilterInputStream.read(FilterInputStream.java:132) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader$LimitedInputStream.read(SegmentedRaftLogReader.java:96) > at java.base/java.io.DataInputStream.read(DataInputStream.java:151) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.verifyHeader(SegmentedRaftLogReader.java:172) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.init(SegmentedRaftLogInputStream.java:95) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:122) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:131) > at > org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:236) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:346) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:295) > at > org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:236) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:186) > at java.base/java.lang.Thread.run(Thread.java:833) > !image-2024-08-21-15-17-45-705.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (RATIS-2116) Follower state synchronization is blocked
[ https://issues.apache.org/jira/browse/RATIS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875551#comment-17875551 ] Tsz-wo Sze edited comment on RATIS-2116 at 8/21/24 4:16 PM: bq. Curiosity about the Ratis community version policy, where can I find the currently supported feature versions? You may search for the "Fix Version/s" to find out the JIRAs or git to generate the diff between two releases. bq. will such a bug fix be backported to lower versions? If there is a need, we definitely can back port bug fixes. If you are interested on a bug fix release for an older version, please feel free to let us know. was (Author: szetszwo): > Curiosity about the Ratis community version policy, where can I find the > currently supported feature versions? You may search for the "Fix Version/s" to find out the JIRAs or git to generate the diff between two releases. > will such a bug fix be backported to lower versions? If there is a need, we definitely can back port bug fixes. > Follower state synchronization is blocked > - > > Key: RATIS-2116 > URL: https://issues.apache.org/jira/browse/RATIS-2116 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 3.0.0, 2.5.1, 3.0.1 >Reporter: Haibo Sun >Assignee: Haibo Sun >Priority: Major > Fix For: 3.2.0 > > Attachments: debug.log > > > Using version 2.5.1, we have discovered that in some cases, the state > synchronization of the follower will be permanently blocked. > Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern > (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of > RaftServerImpl.appendEntries does not immediately flush data and complete the > result future, because there is a pending PurgeLog task in the queue. It > enqueues the result future to be completed after the latter WriteLog flushes > data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and > will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This > leads to a deadlock and causes the state synchronization to stop. > I confirmed this by adding debug logs, detailed information is attached > below. This issue can be easily reproduced by increasing the frequency of > TakeSnapshot and PurgeLog operations. In addition, after checking the code in > the master branch, this issue still exists. > > *jstack:* > {code:java} > "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x7fc58400b800 > nid=0x5493a waiting on condition [0x7fc5b4f28000] java.lang.Thread.State: > WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for > <0x7fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at > sun.misc.Unsafe.park(Unsafe.java:1025) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379) > at > org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649) > at > org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231) > at > org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95) > at > org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91) > at > org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext
[jira] [Commented] (RATIS-2140) Thread wait when installing snapshot
[ https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875559#comment-17875559 ] Tsz-wo Sze commented on RATIS-2140: --- We probably should always add a timeout when calling await(). > Thread wait when installing snapshot > > > Key: RATIS-2140 > URL: https://issues.apache.org/jira/browse/RATIS-2140 > Project: Ratis > Issue Type: Bug >Affects Versions: 3.0.1 >Reporter: guangbao zhao >Priority: Major > > hi, [~szetszwo] I found a problem. In our service, when the leader notify the > follower of InstallSnapshot, the leader may cause the GrcpAppender thread to > be in the wait state due to timing issues, causing the installation of the > snapshot to fail, and triggering the follower to not receive the leader's > heartbeat within the specified timeout period to trigger the election. > The last log that triggered the exception > node1: > {code:java} > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with > firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183 > 2024/08/17 19:36:19,068 > [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-GrpcLogAppender: send > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a > reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: > InstallSnapshot in progress. > 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN > org.apache.ratis.grpc.server.GrpcLogAppender: > node1@group-4F53D3317400->node2-AppendLogResponseHandler: received > INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, > request=AppendEntriesRequest:cid=11690239,entriesCount=0 > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: > receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, > i:16857386182)) > 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO > org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: > reject PRE_VOTE from node2: this server is the leader and still has > leadership > ...{code} > node2: > {code:java} > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed > appendEntries as snapshot (17159569079) installation is in progress > 2024/08/17 19:36:19,068 [node2-server-thread482] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > inconsistency entries. > Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: receive installSnapshot: > node1->node2#0-t139,notify:(t:138, i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.server.impl.SnapshotInstallationHandler: > node2@group-4F53D3317400: reply installSnapshot: > node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0 > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, > i:17159569079) > 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO > org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed > INSTALL_SNAPSHOT, lastReply: null > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.FollowerState: > node2@group-4F53D3317400-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.impl.RoleInfo: node2: shutdown > node2@group-4F53D3317400-FollowerState > 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO > org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: > changes role from FOLLOWER to CANDIDATE at term 139 for changeToCandidate > ...{code} > node2 grpc thread stack: > {code:java} > jstack 118659 | grep -A 12 > node2-GrpcLogAppender-LogAppender
[jira] [Resolved] (RATIS-2137) Leader fails to send correct index to follower after timeout exception
[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-2137. --- Fix Version/s: 3.2.0 Resolution: Fixed The pull request is now merged. Thanks, [~lemony]! > Leader fails to send correct index to follower after timeout exception > -- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 2.5.1 >Reporter: Kevin Liu >Assignee: Kevin Liu >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2024-08-13-11-28-16-250.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > *(repeat 'Failed appendEntries')* > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0 > 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An > exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to > CANDIDATE at term 59 for changeToCandidate > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] > RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default) > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > start 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at > term 59 for PRE_VOTE > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit > vote requests at term 59 for 34233595: > peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], > old=nul
[jira] [Resolved] (HDDS-11331) Fix Datanode unable to report for a long time
[ https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDDS-11331. --- Fix Version/s: 1.5.0 Resolution: Fixed The pull request is now merged. Thanks, [~jianghuazhu]! > Fix Datanode unable to report for a long time > - > > Key: HDDS-11331 > URL: https://issues.apache.org/jira/browse/HDDS-11331 > Project: Apache Ozone > Issue Type: Improvement > Components: DN >Affects Versions: 1.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Fix For: 1.5.0 > > Attachments: 1505js.1, 7090_review.patch, screenshot-1.png, > screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png > > > SCM shows that some Datanodes cannot report for a long time, and their status > is DEAD or STALE. > I printed jstack information, which shows that StateContext#pipelineActions > is stuck and cannot report to SCM/Recon. > The jstack information has been uploaded as an attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Commented] (RATIS-2116) Follower state synchronization is blocked
[ https://issues.apache.org/jira/browse/RATIS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875551#comment-17875551 ] Tsz-wo Sze commented on RATIS-2116: --- > Curiosity about the Ratis community version policy, where can I find the > currently supported feature versions? You may search for the "Fix Version/s" to find out the JIRAs or git to generate the diff between two releases. > will such a bug fix be backported to lower versions? If there is a need, we definitely can back port bug fixes. > Follower state synchronization is blocked > - > > Key: RATIS-2116 > URL: https://issues.apache.org/jira/browse/RATIS-2116 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 3.0.0, 2.5.1, 3.0.1 >Reporter: Haibo Sun >Assignee: Haibo Sun >Priority: Major > Fix For: 3.2.0 > > Attachments: debug.log > > > Using version 2.5.1, we have discovered that in some cases, the state > synchronization of the follower will be permanently blocked. > Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern > (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of > RaftServerImpl.appendEntries does not immediately flush data and complete the > result future, because there is a pending PurgeLog task in the queue. It > enqueues the result future to be completed after the latter WriteLog flushes > data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and > will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This > leads to a deadlock and causes the state synchronization to stop. > I confirmed this by adding debug logs, detailed information is attached > below. This issue can be easily reproduced by increasing the frequency of > TakeSnapshot and PurgeLog operations. In addition, after checking the code in > the master branch, this issue still exists. > > *jstack:* > {code:java} > "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x7fc58400b800 > nid=0x5493a waiting on condition [0x7fc5b4f28000] java.lang.Thread.State: > WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for > <0x7fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at > sun.misc.Unsafe.park(Unsafe.java:1025) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379) > at > org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649) > at > org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231) > at > org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95) > at > org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91) > at > org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) > at > org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelH
[jira] [Resolved] (HDFS-17606) Do not require implementing CustomizedCallbackHandler
[ https://issues.apache.org/jira/browse/HDFS-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-17606. --- Fix Version/s: 3.5.0 Resolution: Fixed The pull request is now merged. > Do not require implementing CustomizedCallbackHandler > - > > Key: HDFS-17606 > URL: https://issues.apache.org/jira/browse/HDFS-17606 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > HDFS-17576 added a CustomizedCallbackHandler interface which declares the > following method: > {code} > void handleCallback(List callbacks, String name, char[] password) > throws UnsupportedCallbackException, IOException; > {code} > This Jira is to allow an implementation to define the handleCallback method > without implementing the CustomizedCallbackHandler interface. It is to avoid > a security provider depending on the HDFS project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17606) Do not require implementing CustomizedCallbackHandler
[ https://issues.apache.org/jira/browse/HDFS-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-17606. --- Fix Version/s: 3.5.0 Resolution: Fixed The pull request is now merged. > Do not require implementing CustomizedCallbackHandler > - > > Key: HDFS-17606 > URL: https://issues.apache.org/jira/browse/HDFS-17606 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > HDFS-17576 added a CustomizedCallbackHandler interface which declares the > following method: > {code} > void handleCallback(List callbacks, String name, char[] password) > throws UnsupportedCallbackException, IOException; > {code} > This Jira is to allow an implementation to define the handleCallback method > without implementing the CustomizedCallbackHandler interface. It is to avoid > a security provider depending on the HDFS project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (RATIS-2137) Leader fails to send correct index to follower after timeout exception
[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reassigned RATIS-2137: - Assignee: Kevin Liu > Leader fails to send correct index to follower after timeout exception > -- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug >Affects Versions: 2.5.1 >Reporter: Kevin Liu >Assignee: Kevin Liu >Priority: Major > Attachments: image-2024-08-13-11-28-16-250.png > > Time Spent: 20m > Remaining Estimate: 0h > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > *(repeat 'Failed appendEntries')* > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0 > 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An > exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to > CANDIDATE at term 59 for changeToCandidate > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] > RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default) > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > start 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at > term 59 for PRE_VOTE > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit > vote requests at term 59 for 34233595: > peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], > old=null > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE
[jira] [Updated] (RATIS-2137) Leader fails to send correct index to follower after timeout exception
[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-2137: -- Component/s: server > Leader fails to send correct index to follower after timeout exception > -- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 2.5.1 >Reporter: Kevin Liu >Assignee: Kevin Liu >Priority: Major > Attachments: image-2024-08-13-11-28-16-250.png > > Time Spent: 20m > Remaining Estimate: 0h > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > *(repeat 'Failed appendEntries')* > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0 > 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An > exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to > CANDIDATE at term 59 for changeToCandidate > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] > RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default) > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > start 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at > term 59 for PRE_VOTE > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit > vote requests at term 59 for 34233595: > peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], > old=null > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-Lea
[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception
[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874915#comment-17874915 ] Tsz-wo Sze commented on RATIS-2137: --- [~lemony], it would be great if you could submit a pull request. Thank you in advance! > Leader fails to send correct index to follower after timeout exception > -- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug >Affects Versions: 2.5.1 >Reporter: Kevin Liu >Priority: Major > Attachments: image-2024-08-13-11-28-16-250.png > > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > *(repeat 'Failed appendEntries')* > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0 > 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An > exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to > CANDIDATE at term 59 for changeToCandidate > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] > RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default) > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > start 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at > term 59 for PRE_VOTE > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit > vote requests at term 59 for 34233595: > peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], > old=null > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733
[jira] [Updated] (HDDS-11331) Fix Datanode unable to report for a long time
[ https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated HDDS-11331: -- Attachment: 7090_review.patch > Fix Datanode unable to report for a long time > - > > Key: HDDS-11331 > URL: https://issues.apache.org/jira/browse/HDDS-11331 > Project: Apache Ozone > Issue Type: Improvement > Components: DN >Affects Versions: 1.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Attachments: 1505js.1, 7090_review.patch, screenshot-1.png, > screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png > > > SCM shows that some Datanodes cannot report for a long time, and their status > is DEAD or STALE. > I printed jstack information, which shows that StateContext#pipelineActions > is stuck and cannot report to SCM/Recon. > The jstack information has been uploaded as an attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Created] (HDFS-17606) Do not require implementing CustomizedCallbackHandler
Tsz-wo Sze created HDFS-17606: - Summary: Do not require implementing CustomizedCallbackHandler Key: HDFS-17606 URL: https://issues.apache.org/jira/browse/HDFS-17606 Project: Hadoop HDFS Issue Type: Improvement Components: security Reporter: Tsz-wo Sze Assignee: Tsz-wo Sze HDFS-17576 added a CustomizedCallbackHandler interface which declares the following method: {code} void handleCallback(List callbacks, String name, char[] password) throws UnsupportedCallbackException, IOException; {code} This Jira is to allow an implementation to define the handleCallback method without implementing the CustomizedCallbackHandler interface. It is to avoid a security provider depending on the HDFS project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17606) Do not require implementing CustomizedCallbackHandler
Tsz-wo Sze created HDFS-17606: - Summary: Do not require implementing CustomizedCallbackHandler Key: HDFS-17606 URL: https://issues.apache.org/jira/browse/HDFS-17606 Project: Hadoop HDFS Issue Type: Improvement Components: security Reporter: Tsz-wo Sze Assignee: Tsz-wo Sze HDFS-17576 added a CustomizedCallbackHandler interface which declares the following method: {code} void handleCallback(List callbacks, String name, char[] password) throws UnsupportedCallbackException, IOException; {code} This Jira is to allow an implementation to define the handleCallback method without implementing the CustomizedCallbackHandler interface. It is to avoid a security provider depending on the HDFS project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception
[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874531#comment-17874531 ] Tsz-wo Sze commented on RATIS-2137: --- [~lemony], you are right that LogAppenderDefault incorrectly handles the INCONSISTENCY case. It should use the getNextIndexForInconsistency(..) method in LogAppenderBase and then setNextIndex(..). {code} diff --git a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java index 432a4199..f75a80f8 100644 --- a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java +++ b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java @@ -23,6 +23,7 @@ import org.apache.ratis.proto.RaftProtos.InstallSnapshotReplyProto; import org.apache.ratis.proto.RaftProtos.InstallSnapshotRequestProto; import org.apache.ratis.rpc.CallId; import org.apache.ratis.server.RaftServer; +import org.apache.ratis.server.raftlog.RaftLog; import org.apache.ratis.server.raftlog.RaftLogIOException; import org.apache.ratis.server.util.ServerStringUtils; import org.apache.ratis.statemachine.SnapshotInfo; @@ -34,6 +35,7 @@ import java.io.InterruptedIOException; import java.util.Comparator; import java.util.UUID; import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicLong; /** * The default implementation of {@link LogAppender} @@ -55,7 +57,7 @@ class LogAppenderDefault extends LogAppenderBase { } /** Send an appendEntries RPC; retry indefinitely. */ - private AppendEntriesReplyProto sendAppendEntriesWithRetries() + private AppendEntriesReplyProto sendAppendEntriesWithRetries(AtomicLong requestFirstIndex) throws InterruptedException, InterruptedIOException, RaftLogIOException { int retry = 0; @@ -78,9 +80,12 @@ class LogAppenderDefault extends LogAppenderBase { return null; } -AppendEntriesReplyProto r = sendAppendEntries(request.get()); +final AppendEntriesRequestProto proto = request.get(); +final AppendEntriesReplyProto reply = sendAppendEntries(proto); +final long first = proto.getEntriesCount() > 0 ? proto.getEntries(0).getIndex() : RaftLog.INVALID_LOG_INDEX; +requestFirstIndex.set(first); request.release(); -return r; +return reply; } catch (InterruptedIOException | RaftLogIOException e) { throw e; } catch (IOException ioe) { @@ -164,9 +169,10 @@ class LogAppenderDefault extends LogAppenderBase { } // otherwise if r is null, retry the snapshot installation } else { - final AppendEntriesReplyProto r = sendAppendEntriesWithRetries(); + final AtomicLong requestFirstIndex = new AtomicLong(RaftLog.INVALID_LOG_INDEX); + final AppendEntriesReplyProto r = sendAppendEntriesWithRetries(requestFirstIndex); if (r != null) { -handleReply(r); +handleReply(r, requestFirstIndex.get()); } } } @@ -177,7 +183,8 @@ class LogAppenderDefault extends LogAppenderBase { } } - private void handleReply(AppendEntriesReplyProto reply) throws IllegalArgumentException { + private void handleReply(AppendEntriesReplyProto reply, long requestFirstIndex) + throws IllegalArgumentException { if (reply != null) { switch (reply.getResult()) { case SUCCESS: @@ -200,7 +207,7 @@ class LogAppenderDefault extends LogAppenderBase { onFollowerTerm(reply.getTerm()); break; case INCONSISTENCY: - getFollower().decreaseNextIndex(reply.getNextIndex()); + getFollower().setNextIndex(getNextIndexForInconsistency(requestFirstIndex, reply.getNextIndex())); break; case UNRECOGNIZED: LOG.warn("{}: received {}", this, reply.getResult()); {code} > Leader fails to send correct index to follower after timeout exception > -- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug >Affects Versions: 2.5.1 >Reporter: Kevin Liu >Priority: Major > Attachments: image-2024-08-13-11-28-16-250.png > > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] Raf
[jira] [Commented] (HDDS-11331) Datanode cannot report for a long time
[ https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874526#comment-17874526 ] Tsz-wo Sze commented on HDDS-11331: --- Sure, it seems a good idea! > Datanode cannot report for a long time > -- > > Key: HDDS-11331 > URL: https://issues.apache.org/jira/browse/HDDS-11331 > Project: Apache Ozone > Issue Type: Improvement > Components: DN >Affects Versions: 1.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: 1505js.1, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png > > > SCM shows that some Datanodes cannot report for a long time, and their status > is DEAD or STALE. > I printed jstack information, which shows that StateContext#pipelineActions > is stuck and cannot report to SCM/Recon. > The jstack information has been uploaded as an attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Commented] (HDDS-11331) Datanode cannot report for a long time
[ https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874353#comment-17874353 ] Tsz-wo Sze commented on HDDS-11331: --- It seem that this can be fixed by changing pipelineActions to ConcurrentHashMap and Collections.synchronizedMap(LinkedHashMap) {code} private final Map> pipelineActions = new ConcurrentHashMap<>(); {code} {code} static class Key { private final HddsProtos.PipelineID pipelineID; private final PipelineAction.Action action; Key(HddsProtos.PipelineID pipelineID, PipelineAction.Action action) { this.pipelineID = pipelineID; this.action = action; } @Override public int hashCode() { return Objects.hashCode(pipelineID); } @Override public boolean equals(Object obj) { if (this == obj) { return true; } else if (!(obj instanceof Key)) { return false; } final Key that = (Key) obj; return Objects.equals(this.action, that.action) && Objects.equals(this.pipelineID, that.pipelineID); } } {code} > Datanode cannot report for a long time > -- > > Key: HDDS-11331 > URL: https://issues.apache.org/jira/browse/HDDS-11331 > Project: Apache Ozone > Issue Type: Improvement > Components: DN >Affects Versions: 1.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: 1505js.1, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png > > > SCM shows that some Datanodes cannot report for a long time, and their status > is DEAD or STALE. > I printed jstack information, which shows that StateContext#pipelineActions > is stuck and cannot report to SCM/Recon. > The jstack information has been uploaded as an attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Commented] (HDDS-11291) Datanode Command Handler blocked by executing ratis requests
[ https://issues.apache.org/jira/browse/HDDS-11291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874345#comment-17874345 ] Tsz-wo Sze commented on HDDS-11291: --- This problem may be similar to HDDS-11331. > Datanode Command Handler blocked by executing ratis requests > > > Key: HDDS-11291 > URL: https://issues.apache.org/jira/browse/HDDS-11291 > Project: Apache Ozone > Issue Type: Bug >Reporter: Janus Chow >Assignee: Janus Chow >Priority: Major > Labels: pull-request-available > > We met the following issue: Datanode command handler executing close > container request, but the timeout logic is not correct, so it blocks all > requests from SCM. > The jstack shows as follows: > {code:java} > "Command processor thread" #215 daemon prio=5 os_prio=0 > tid=0x7fcef3262000 nid=0xa56 waiting on condition [0x7fcf63f9d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fd4ab6dcd38> (a > java.util.concurrent.CompletableFuture$Signaller) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > at > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > at > org.apache.ratis.server.impl.RaftServerImpl.executeSubmitClientRequestAsync(RaftServerImpl.java:816) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$7(RaftServerProxy.java:436) > at > org.apache.ratis.server.impl.RaftServerProxy$$Lambda$827/1961332062.apply(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995) > at > java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137) > at > org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:436) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.submitRequest(XceiverServerRatis.java:611) > at > org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CloseContainerCommandHandler.handle(CloseContainerCommandHandler.java:105) > at > org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:103) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$3(DatanodeStateMachine.java:593) > at > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine$$Lambda$270/1788388131.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {code} > The direct reason is the timeout logic is not working, because in Ratis the > executeSubmitClientRequestAsync is a join() operation, and it will block the > timeout on the outer CompletableFuture. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org
[jira] [Commented] (HDDS-11331) Datanode cannot report for a long time
[ https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874293#comment-17874293 ] Tsz-wo Sze commented on HDDS-11331: --- [~jianghuazhu], why the thread was stuck in calculatePipelineBytesWritten()? > Datanode cannot report for a long time > -- > > Key: HDDS-11331 > URL: https://issues.apache.org/jira/browse/HDDS-11331 > Project: Apache Ozone > Issue Type: Improvement > Components: DN >Affects Versions: 1.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: 1505js.1 > > > SCM shows that some Datanodes cannot report for a long time, and their status > is DEAD or STALE. > I printed jstack information, which shows that StateContext#pipelineActions > is stuck and cannot report to SCM/Recon. > The jstack information has been uploaded as an attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org