from:"Tsz\-wo Sze \(Jira\)"

[jira] [Commented] (HADOOP-19281) MetricsSystemImpl should not print INFO message in CLI

2024-09-20 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HADOOP-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883313#comment-17883313
 ] 

Tsz-wo Sze commented on HADOOP-19281:
-

[~sarvekshayr], thanks for checking.  Could you try "hadoop fs" such as the 
command below?
{code}
hadoop fs  -Dfs.s3a.bucket.probe=0 
-Dfs.s3a.change.detection.version.required=false 
-Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 
-Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 
-Dfs.s3a.endpoint=http://some.site:9878  -Dfs.s3a.path.style.access=true 
-Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   -ls  -R s3a://bucket1/
{code}
You are right that this is a problem in Hadoop but not Ozone.  Moved this to 
Hadoop Common.

> MetricsSystemImpl should not print INFO message in CLI
> --
>
> Key: HADOOP-19281
> URL: https://issues.apache.org/jira/browse/HADOOP-19281
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Tsz-wo Sze
>Priority: Major
>  Labels: newbie
>
> Below is an example:
> {code}
> # hadoop fs  -Dfs.s3a.bucket.probe=0 
> -Dfs.s3a.change.detection.version.required=false 
> -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 
> -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 
> -Dfs.s3a.endpoint=http://some.site:9878  -Dfs.s3a.path.style.access=true 
> -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   -ls  -R s3a://bucket1/
> 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
> period at 10 second(s).
> 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> started
> 24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option 
> fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 
> ms instead
> 24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an 
> instance of MultipartS3AsyncClient, and thus multipart download feature is 
> not enabled. To benefit from all features, consider using 
> S3AsyncClient.crtBuilder().build() instead.
> drwxrwxrwx   - root root  0 2024-09-17 10:47 s3a://bucket1/dir1
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> stopped.
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> shutdown complete. 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Assigned] (HADOOP-19281) MetricsSystemImpl should not print INFO message in CLI

2024-09-20 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned HADOOP-19281:
---

Component/s: metrics
 (was: metrics)
Key: HADOOP-19281  (was: HDDS-11466)
   Workflow: no-reopen-closed, patch-avail  (was: patch-available, re-open 
possible)
   Assignee: (was: Sarveksha Yeshavantha Raju)
Project: Hadoop Common  (was: Apache Ozone)

> MetricsSystemImpl should not print INFO message in CLI
> --
>
> Key: HADOOP-19281
> URL: https://issues.apache.org/jira/browse/HADOOP-19281
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Tsz-wo Sze
>Priority: Major
>  Labels: newbie
>
> Below is an example:
> {code}
> # hadoop fs  -Dfs.s3a.bucket.probe=0 
> -Dfs.s3a.change.detection.version.required=false 
> -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 
> -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 
> -Dfs.s3a.endpoint=http://some.site:9878  -Dfs.s3a.path.style.access=true 
> -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   -ls  -R s3a://bucket1/
> 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
> period at 10 second(s).
> 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> started
> 24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option 
> fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 
> ms instead
> 24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an 
> instance of MultipartS3AsyncClient, and thus multipart download feature is 
> not enabled. To benefit from all features, consider using 
> S3AsyncClient.crtBuilder().build() instead.
> drwxrwxrwx   - root root  0 2024-09-17 10:47 s3a://bucket1/dir1
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> stopped.
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> shutdown complete. 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Created] (RATIS-2160) MetricRegistriesLoader should not print INFO message in CLI

2024-09-19 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created RATIS-2160:
-

 Summary: MetricRegistriesLoader should not print INFO message in 
CLI
 Key: RATIS-2160
 URL: https://issues.apache.org/jira/browse/RATIS-2160
 Project: Ratis
  Issue Type: Bug
  Components: shell
Reporter: Tsz-wo Sze


MetricRegistriesLoader uses MetricRegistries log to print the following INFO 
message.
{code}
2024-09-19 10:56:43 INFO  MetricRegistries:64 - Loaded MetricRegistries class 
org.apache.ratis.metrics.impl.MetricRegistriesImpl
{code}
Note that "MetricRegistries:64" is very misleading since 64 actually is the 
line 64 in MetricRegistriesLoader, not MetricRegistries.
{code}
//line 64 in MetricRegistriesLoader
  LOG.info("Loaded MetricRegistries " + impl.getClass());
{code}

When there are multiple implementations, it will print the following WARN 
message instead.
{code}
[main] WARN org.apache.ratis.metrics.MetricRegistries - Found multiple 
MetricRegistries implementations: class 
org.apache.ratis.metrics.impl.MetricRegistriesImpl, class 
org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl. Using first found 
implementation: org.apache.ratis.metrics.impl.MetricRegistriesImpl@4d1b0d2a
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HDDS-11470) OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed

2024-09-19 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883060#comment-17883060
 ] 

Tsz-wo Sze commented on HDDS-11470:
---

Below is an example that om2 replied "Completed INSTALL_SNAPSHOT" but it 
actually had failed to move downloaded DB checkpoint due to a UnixException 
"Invalid cross-device link".
{code}
2024-09-18 09:28:20,365 INFO 
[grpc-default-executor-1]-org.apache.ratis.grpc.server.GrpcServerProtocolService:
 om2: Completed INSTALL_SNAPSHOT, lastReply: null
2024-09-18 09:28:20,365 INFO 
[pool-33-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: metadataManager is 
stopped. Spend 7 ms.
2024-09-18 09:28:20,367 ERROR 
[pool-33-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Failed to move 
downloaded DB checkpoint /var/lib/hadoop-ozone/om/ozone-metaot/om.db.candidate 
to metadata directory /ozone/hadoop-ozone/om/data/om.db. Exception: {}. 
Resetting to original DB.
java.nio.file.FileSystemException: /ozone/hadoop-ozone/om/data/om.db/44.sst 
-> /var/lib/hadoop-ozone/om/ozone-metadata/snapshot/om.db.candidate/44.sst: 
Invalid cross-device link
at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at 
java.base/sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:477)
at java.base/java.nio.file.Files.createLink(Files.java:1101)
at 
org.apache.hadoop.ozone.om.snapshot.OmSnapshotUtils.linkFiles(OmSnapshotUtils.java:169)
at 
org.apache.hadoop.ozone.om.OzoneManager.moveCheckpointFiles(OzoneManager.java:3884)
at 
org.apache.hadoop.ozone.om.OzoneManager.replaceOMDBWithCheckpoint(OzoneManager.java:3864)
at 
org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3738)
at 
org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3673)
at 
org.apache.hadoop.ozone.om.OzoneManager.installSnapshotFromLeader(OzoneManager.java:3650)
at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$5(OzoneManagerStateMachine.java:505)
at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}


> OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed
> 
>
> Key: HDDS-11470
> URL: https://issues.apache.org/jira/browse/HDDS-11470
> Project: Apache Ozone
>  Issue Type: Bug
>  Components: OM HA
>Reporter: Tsz-wo Sze
>Priority: Major
>
> When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply 
> "Completed INSTALL_SNAPSHOT".
> In the code below, when there is an exception, it just print an error message 
> and continue to reply "Completed INSTALL_SNAPSHOT".
> {code}
> //OzoneManager.installCheckpoint
>   try {
> time = Time.monotonicNow();
> dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex,
> oldDBLocation, checkpointLocation);
> term = checkpointTrxnInfo.getTerm();
> lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex();
> LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, " +
> "index: {}, time: {} ms", leaderId, term, lastAppliedIndex,
> Time.monotonicNow() - time);
>   } catch (Exception e) {
> LOG.error("Failed to install Snapshot from {} as OM failed to 
> replace" +
> " DB with downloaded checkpoint. Reloading old OM state.",
> leaderId, e);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Updated] (RATIS-2145) Follower hangs until the next trigger to take a snapshot

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2145:
--
Fix Version/s: (was: 3.2.0)

>  Follower hangs until the next trigger to take a snapshot
> -
>
> Key: RATIS-2145
> URL: https://issues.apache.org/jira/browse/RATIS-2145
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Assignee: guangbao zhao
>Priority: Major
> Fix For: 3.1.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We discovered a problem when writing tests with high concurrency. It often 
> happens that a follower is running well and then triggers takeSnalshot.
> The following is the relevant log.
> follower: （as the follower log says, between 2024/08/22 20:18:14,044 and 
> 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the 
> follower election was not triggered, indicating that the leader gave The 
> heartbeat sent by the follower is successful）
> {code:java}
> 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO 
> org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly 
> 22436696498 -> 22441096501
> 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment 
> /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615
> 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax 
> old=22432683959, new=22437078979, updated? true
> 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO 
> com.xxx.RaftJournalManager: Received install snapshot notification from 
> MetaStore leader: node3 with term index: (t:192, i:22441477801)
> 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO 
> com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from 
> Leader MetaStore node3. Checkpoint address: leader:8170
> 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, 
> i:22441477801)
> 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/22 20:21:57,067 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed 
> appendEntries as snapshot (22441477801) installation is in progress
> 2024/08/22 20:21:57,068 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code}
> leader:
> {code:java}
> 2024/08/22 20:18:16,958 [timer5] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, 
> i:22441098598)...(t:192, i:22441098622)
> 2024/08/22 20:18:16,964 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, 
> i:22441098624)
> 2024/08/22 20:18:16,964 [timer6] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, 
> i:22441098625)
> 2024/08/22 20:18:16,964 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, 
> i:22441098623)
> 2024/08/22 20:18:16,965 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192, 
> i:22441098627)
> 2024/08/22 20:18:16,965 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=168

[jira] [Updated] (RATIS-2146) Fixed possible issues caused by concurrent deletion and election when member changes

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2146:
--
Fix Version/s: (was: 3.2.0)

> Fixed possible issues caused by concurrent deletion and election when member 
> changes
> 
>
> Key: RATIS-2146
> URL: https://issues.apache.org/jira/browse/RATIS-2146
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Xinyu Tan
>Assignee: Xinyu Tan
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: image-2024-08-28-14-53-23-259.png, 
> image-2024-08-28-14-53-27-637.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> During this process, we encountered some concurrency issues:
> * After the member change is complete, node D will no longer be a member of 
> this consensus group. It will attempt to initiate an election but receive a 
> NOT_IN_CONF response, after which it will close itself.
> * During the removal of member D, it will also close itself first, and then 
> proceed to delete the file directory.
> These two CLOSE operations may occur concurrently, which could result in the 
> directory being deleted while the StateMachineUpdater thread has not yet 
> closed, ultimately leading to unexpected errors.
>  !image-2024-08-28-14-53-23-259.png! 
>  !image-2024-08-28-14-53-27-637.png! 
> I believe there are two possible solutions for this issue:
> * Add concurrency control to the close function, such as adding the 
> synchronized keyword to the function.
> * Add some checks before deleting the directory to ensure that the callback 
> functions in the close process have already been executed before the 
> directory is deleted.
> What's your opinion? [~szetszwo]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2153) ratis-version.properties missing from src bundle

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2153:
--
Fix Version/s: (was: 3.2.0)

> ratis-version.properties missing from src bundle
> 
>
> Key: RATIS-2153
> URL: https://issues.apache.org/jira/browse/RATIS-2153
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Attila Doroszlai
>Assignee: Attila Doroszlai
>Priority: Blocker
> Fix For: 3.1.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> RATIS-1840 added {{src/main/resources/ratis-version.properties}} in root 
> module.  This file is missing from the {{src}} assembly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2149) Do not perform leader election if the current RaftServer has not started yet

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2149:
--
Fix Version/s: 3.1.1
   (was: 3.2.0)

> Do not perform leader election if the current RaftServer has not started yet
> 
>
> Key: RATIS-2149
> URL: https://issues.apache.org/jira/browse/RATIS-2149
> Project: Ratis
>  Issue Type: Improvement
>  Components: election
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: image-2024-09-03-17-41-41-872.png, 
> image-2024-09-03-18-13-50-628.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Sometimes we cannot guarantee that the program will run normally in various 
> environments, and appropriate robustness enhancement may be necessary.
> Before adding members, RaftServer S and the corresponding group will be 
> created if the group is not exist and find that the interval between these 
> two logs is more than one minute.
> !image-2024-09-03-17-41-41-872.png!
>  
> Since our RpcTimeout is small 1 minute, the retryPolicy has already started, 
> but S's groupId is already in the implMaps of RaftServerProxy, which will 
> throw AlreadyExistException. When we catch this exception, we assume that the 
> creation has been completed and the member change can be executed
>  
> S is still in the initializing state, and this member change will not be 
> completed. Finally, we found that S started the election and received 
> NOT_IN_CONF reply, and then S will be closed
> !image-2024-09-03-18-13-50-628.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2148) Snapshot transfer may cause followers to trigger reloadStateMachine incorrectly

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2148:
--
Fix Version/s: 3.1.1
   (was: 3.2.0)

> Snapshot transfer may cause followers to trigger reloadStateMachine 
> incorrectly
> ---
>
> Key: RATIS-2148
> URL: https://issues.apache.org/jira/browse/RATIS-2148
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Affects Versions: 3.1.0
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: image-2024-09-03-14-24-25-652.png, 
> image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, 
> image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, 
> image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Due to the fact that grpc streaming snapshot sending sends all requests at 
> once, error handling is performed after all are sent, and the last snapshot 
> request is used as a completion flag, which may lead to the successful 
> receipt of the last request, but the previous request has failed. The sender 
> handles the failure event during the retransmission of the snapshot. The 
> receiver triggers state.reloadStateMachine because it successfully receives 
> the last request, but due to incomplete snapshot reception
>  
> An md5 mismatch exception occurred before the last SnapshotRequest was 
> received
> !image-2024-09-03-14-27-39-406.png!
>  
> The last snapshot request arrived, then successfully received, and then 
> updated the index.
> !image-2024-09-03-14-28-31-529.png!
> !image-2024-09-03-14-30-02-751.png!
>  
> However, the snapshot reception is incomplete and triggers the 
> reloadStateMachine.
> !image-2024-09-03-14-33-49-573.png!
>  
> I suggest using a flag to identify whether the entire snapshot request is 
> abnormal.
> If an exception occurs, the subsequent content of the request will not be 
> processed.
> Or the sender will wait for the receiver's reply. If there is a release 
> error, resend it.
>  
> Finally, the current error retry level is the entire snapshot directory 
> rather than a single chunk, which will cause a large number of snapshot files 
> to be sent repeatedly, which can be optimized later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2154) The old leader may send appendEntries after term changed

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2154:
--
Fix Version/s: 3.1.1
   (was: 3.2.0)

> The old leader may send appendEntries after term changed
> 
>
> Key: RATIS-2154
> URL: https://issues.apache.org/jira/browse/RATIS-2154
> Project: Ratis
>  Issue Type: Wish
>  Components: Leader
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: image-2024-09-12-09-43-30-670.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The leader will become a follower after receiving a higher term, but during 
> this process, the old leader may be appending LogEntry, and the error log 
> will be printed until LogAppenderDaemon is closed.
> !image-2024-09-12-09-43-30-670.png!
>  
> I think we can put state.updateCurrentTerm (newTerm) later. Close LeaderState 
> first before updating the term, and other operations remain unchanged.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2150) No need for manual assembly:single execution when mvn depoly

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2150:
--
Fix Version/s: 3.1.1

> No need for manual assembly:single execution when mvn depoly
> 
>
> Key: RATIS-2150
> URL: https://issues.apache.org/jira/browse/RATIS-2150
> Project: Ratis
>  Issue Type: Improvement
>  Components: build
>Reporter: Xinyu Tan
>Assignee: Xinyu Tan
>Priority: Major
> Fix For: 3.1.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This [RATIS-2117|https://issues.apache.org/jira/browse/RATIS-2117] ignores 
> the mvn deoply command update, which will be updated in this ISSUE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2152) GrpcLogAppender stucks while sending an installSnapshot notification request

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2152:
--
Fix Version/s: 3.1.1
   (was: 3.2.0)

> GrpcLogAppender stucks while sending an installSnapshot notification request
> 
>
> Key: RATIS-2152
> URL: https://issues.apache.org/jira/browse/RATIS-2152
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Chung En Lee
>Assignee: Chung En Lee
>Priority: Major
> Fix For: 3.1.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In `GrpcLogAppender`, it waits for signal at the end of 
> `notifyInstallSnapshot` as following.
> [https://github.com/apache/ratis/blob/master/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L825-L831]
> However, checking whether the `InstallSnapshotResponseHandler` is done and 
> the call `AwaitForSignal.await()` are not atomic. This creates a potential 
> race condition where InstallSnapshotResponseHandler.close() could finish 
> after the check but before the wait, causing that `GrpcLogAppender` is still 
> waiting even though `InstallSnapshotResponseHandler` has already completed, 
> leading to timeout. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2080) Reuse LeaderElection executor

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2080.
---
Fix Version/s: (was: 3.1.0)
   Resolution: Won't Do

Resolving this as "Won't Do".

> Reuse LeaderElection executor
> -
>
> Key: RATIS-2080
> URL: https://issues.apache.org/jira/browse/RATIS-2080
> Project: Ratis
>  Issue Type: Improvement
>  Components: election
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When running TestRaftWithNetty#testWithLoad with 5 servers, there were 110 
> leader election threads as shown below.  We should reuse the vote executor.
> {code}
> $cat threaddump.txt | grep "C-LeaderElection" | wc
>  1101320   25621
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2147) MD5 mismatch when accept snapshot

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2147.
---
Fix Version/s: 3.1.1
   Resolution: Fixed

The pull request was merged. Thanks, [~tohsakarin__]!

> MD5 mismatch when accept snapshot
> -
>
> Key: RATIS-2147
> URL: https://issues.apache.org/jira/browse/RATIS-2147
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Affects Versions: 3.1.0, 3.2.0
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: image-2024-09-03-10-35-08-315.png, 
> image-2024-09-03-10-35-28-617.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> We encountered an MD5 mismatch issue in IoTDB, and after multiple 
> investigations, we found that the digester was contaminated
>  
> We have checked that it is not a network and disk problem
>  
> In implementation, the received snapshot will be written to a temporary file 
> first. If there is an md5 mismatch, we will read the data from this temporary 
> file and use a new digest to calculate md5, but the result of this 
> calculation is the same as the md5 hash value sent
> !image-2024-09-03-10-35-28-617.png!
>  
> !image-2024-09-03-10-35-08-315.png!
>  
>  
> Use the saved corrupted file name to locate the relevant log, here to 
> tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA!
> Before encountering corrupt, the sender sent several consecutive snapshot 
> installation requests to the receiver.
>  
> The receiver successfully received some requests, and then encountered a 
> request for corrupt, and began printing "recompute again" to start 
> recalculating.
>  
> After execution, the ERROR log of the rename will be printed, and the data 
> will be read from the file and compared with the received chunk data.
>  
> If a byte does not match, the corresponding information will be printed, but 
> no log information will be printed, which means that the content written to 
> the disk is the same as the content sent
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA!
> This makes the problem very clear. There is a problem with the MD5 
> calculation class, and the reasons are as follows：
>  
>      If a byte in the middle of the data part is incorrect due to network 
> reasons, the calculated result and the hash sent must be different
>  
>     If there is a problem with the part that stores the hash value, the final 
> calculation result will also be different.
>  
> I suggest creating a new digest every time follower receive a snapshot, so as 
> to avoid pollution problems. Under normal network and disk conditions, 
> Corrupt will not occur



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2147) MD5 mismatch when accept snapshot

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2147:
--
Affects Version/s: (was: 3.1.0)
   (was: 3.2.0)

> MD5 mismatch when accept snapshot
> -
>
> Key: RATIS-2147
> URL: https://issues.apache.org/jira/browse/RATIS-2147
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: image-2024-09-03-10-35-08-315.png, 
> image-2024-09-03-10-35-28-617.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> We encountered an MD5 mismatch issue in IoTDB, and after multiple 
> investigations, we found that the digester was contaminated
>  
> We have checked that it is not a network and disk problem
>  
> In implementation, the received snapshot will be written to a temporary file 
> first. If there is an md5 mismatch, we will read the data from this temporary 
> file and use a new digest to calculate md5, but the result of this 
> calculation is the same as the md5 hash value sent
> !image-2024-09-03-10-35-28-617.png!
>  
> !image-2024-09-03-10-35-08-315.png!
>  
>  
> Use the saved corrupted file name to locate the relevant log, here to 
> tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA!
> Before encountering corrupt, the sender sent several consecutive snapshot 
> installation requests to the receiver.
>  
> The receiver successfully received some requests, and then encountered a 
> request for corrupt, and began printing "recompute again" to start 
> recalculating.
>  
> After execution, the ERROR log of the rename will be printed, and the data 
> will be read from the file and compared with the received chunk data.
>  
> If a byte does not match, the corresponding information will be printed, but 
> no log information will be printed, which means that the content written to 
> the disk is the same as the content sent
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA!
> This makes the problem very clear. There is a problem with the MD5 
> calculation class, and the reasons are as follows：
>  
>      If a byte in the middle of the data part is incorrect due to network 
> reasons, the calculated result and the hash sent must be different
>  
>     If there is a problem with the part that stores the hash value, the final 
> calculation result will also be different.
>  
> I suggest creating a new digest every time follower receive a snapshot, so as 
> to avoid pollution problems. Under normal network and disk conditions, 
> Corrupt will not occur



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2157) Enhance make_rc.sh for non-first rc at release time

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2157:
--
  Component/s: build
Fix Version/s: 3.1.1
   (was: 3.2.0)

> Enhance make_rc.sh for non-first rc at release time
> ---
>
> Key: RATIS-2157
> URL: https://issues.apache.org/jira/browse/RATIS-2157
> Project: Ratis
>  Issue Type: Improvement
>  Components: build
>Reporter: Xinyu Tan
>Assignee: Xinyu Tan
>Priority: Major
> Fix For: 3.1.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:java}
> git commit -a -m "Change version for the version $RATISVERSION $RC
> {code}
> may fail when sending a subsequent RC for the current version because the mvn 
> version has already been executed. Therefore, there are no changes to commit. 
> So we need to add a -allow-empty directive
> BTW, it maybe better to remove -X for publish-mvn as it prints a lot of logs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2155) Add a builder for RatisShell

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2155:
--
Fix Version/s: 3.1.1
   (was: 3.2.0)

> Add a builder for RatisShell
> 
>
> Key: RATIS-2155
> URL: https://issues.apache.org/jira/browse/RATIS-2155
> Project: Ratis
>  Issue Type: New Feature
>  Components: shell
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Fix For: 3.1.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, RatisShell is executed via CLI.  It will use the default 
> RaftProperties and a null Parameters to build a RaftClient.  There is no way 
> to pass TlsConf, as a result, RatisShell cannot access secure clusters.
> This JIRA is to add a builder in order to pass RaftProperties and Parameters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2158) Let the snapshot sender and receiver use a new digester each time

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2158.
---
Fix Version/s: 3.1.1
   Resolution: Fixed

The pull request is now merged.  Thanks, [~tohsakarin__]!

>  Let the snapshot sender and receiver use a new digester each time
> --
>
> Key: RATIS-2158
> URL: https://issues.apache.org/jira/browse/RATIS-2158
> Project: Ratis
>  Issue Type: Wish
>  Components: server
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.1.1
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> This is a follow-up improvement to issue RATIS-2147 MD5 mismatch when accept 
> snapshot - ASF JIRA (apache.org).
> The pr of 2147:  [RATIS-2147. Md5 mismatch when snapshot install by 
> 133tosakarin · Pull Request #1142 · apache/ratis 
> (github.com)|https://github.com/apache/ratis/pull/1142]
>  
> Since snapshot files are not sent frequently and there is not much 
> performance loss when using a new digester each time, in order to be more 
> secure, the snapshot sender and receiver should use a new digester each time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (RATIS-2158) Let the snapshot sender and receiver use a new digester each time

2024-09-19 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-2158:
-

Component/s: server
   Assignee: yuuka

>  Let the snapshot sender and receiver use a new digester each time
> --
>
> Key: RATIS-2158
> URL: https://issues.apache.org/jira/browse/RATIS-2158
> Project: Ratis
>  Issue Type: Wish
>  Components: server
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> This is a follow-up improvement to issue RATIS-2147 MD5 mismatch when accept 
> snapshot - ASF JIRA (apache.org).
> The pr of 2147:  [RATIS-2147. Md5 mismatch when snapshot install by 
> 133tosakarin · Pull Request #1142 · apache/ratis 
> (github.com)|https://github.com/apache/ratis/pull/1142]
>  
> Since snapshot files are not sent frequently and there is not much 
> performance loss when using a new digester each time, in order to be more 
> secure, the snapshot sender and receiver should use a new digester each time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HDDS-11470) OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed

2024-09-18 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated HDDS-11470:
--
Description: 
When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply 
"Completed INSTALL_SNAPSHOT".

In the code below, when there is an exception, it just print an error message 
and continue to reply "Completed INSTALL_SNAPSHOT".
{code}
//OzoneManager.installCheckpoint
  try {
time = Time.monotonicNow();
dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex,
oldDBLocation, checkpointLocation);
term = checkpointTrxnInfo.getTerm();
lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex();
LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, " +
"index: {}, time: {} ms", leaderId, term, lastAppliedIndex,
Time.monotonicNow() - time);
  } catch (Exception e) {
LOG.error("Failed to install Snapshot from {} as OM failed to replace" +
" DB with downloaded checkpoint. Reloading old OM state.",
leaderId, e);
  }
{code}

  was:
When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply 
"Completed INSTALL_SNAPSHOT".
{code}

{code}


> OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed
> 
>
> Key: HDDS-11470
> URL: https://issues.apache.org/jira/browse/HDDS-11470
> Project: Apache Ozone
>  Issue Type: Bug
>  Components: OM HA
>Reporter: Tsz-wo Sze
>Priority: Major
>
> When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply 
> "Completed INSTALL_SNAPSHOT".
> In the code below, when there is an exception, it just print an error message 
> and continue to reply "Completed INSTALL_SNAPSHOT".
> {code}
> //OzoneManager.installCheckpoint
>   try {
> time = Time.monotonicNow();
> dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex,
> oldDBLocation, checkpointLocation);
> term = checkpointTrxnInfo.getTerm();
> lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex();
> LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, " +
> "index: {}, time: {} ms", leaderId, term, lastAppliedIndex,
> Time.monotonicNow() - time);
>   } catch (Exception e) {
> LOG.error("Failed to install Snapshot from {} as OM failed to 
> replace" +
> " DB with downloaded checkpoint. Reloading old OM state.",
> leaderId, e);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Created] (HDDS-11470) OM should not reply Completed INSTALL_SNAPSHOT when installCheckpoint failed

2024-09-18 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created HDDS-11470:
-

 Summary: OM should not reply Completed INSTALL_SNAPSHOT when 
installCheckpoint failed
 Key: HDDS-11470
 URL: https://issues.apache.org/jira/browse/HDDS-11470
 Project: Apache Ozone
  Issue Type: Bug
  Components: OM HA
Reporter: Tsz-wo Sze


When OM failed to installCheckpoint (e.g. HDDS-10300), it should not reply 
"Completed INSTALL_SNAPSHOT".
{code}

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Commented] (HDDS-11466) MetricsSystemImpl should not print INFO message in CLI

2024-09-18 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882737#comment-17882737
 ] 

Tsz-wo Sze commented on HDDS-11466:
---

We may simpler change the INFO message to debug.

bq. 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: 
tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

For the WARN message above, is it really needed?  For CLI, we should minimize 
the unrelated messages.

> MetricsSystemImpl should not print INFO message in CLI
> --
>
> Key: HDDS-11466
> URL: https://issues.apache.org/jira/browse/HDDS-11466
> Project: Apache Ozone
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Tsz-wo Sze
>Assignee: Sarveksha Yeshavantha Raju
>Priority: Major
>  Labels: newbie
>
> Below is an example:
> {code}
> # hadoop fs  -Dfs.s3a.bucket.probe=0 
> -Dfs.s3a.change.detection.version.required=false 
> -Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 
> -Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 
> -Dfs.s3a.endpoint=http://some.site:9878  -Dfs.s3a.path.style.access=true 
> -Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   -ls  -R s3a://bucket1/
> 24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
> period at 10 second(s).
> 24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> started
> 24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option 
> fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 
> ms instead
> 24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an 
> instance of MultipartS3AsyncClient, and thus multipart download feature is 
> not enabled. To benefit from all features, consider using 
> S3AsyncClient.crtBuilder().build() instead.
> drwxrwxrwx   - root root  0 2024-09-17 10:47 s3a://bucket1/dir1
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> stopped.
> 24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
> shutdown complete. 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Resolved] (RATIS-2155) Add a builder for RatisShell

2024-09-17 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2155.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.

> Add a builder for RatisShell
> 
>
> Key: RATIS-2155
> URL: https://issues.apache.org/jira/browse/RATIS-2155
> Project: Ratis
>  Issue Type: New Feature
>  Components: shell
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Fix For: 3.2.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, RatisShell is executed via CLI.  It will use the default 
> RaftProperties and a null Parameters to build a RaftClient.  There is no way 
> to pass TlsConf, as a result, RatisShell cannot access secure clusters.
> This JIRA is to add a builder in order to pass RaftProperties and Parameters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HDDS-11466) MetricsSystemImpl should not print INFO message in CLI

2024-09-17 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created HDDS-11466:
-

 Summary: MetricsSystemImpl should not print INFO message in CLI
 Key: HDDS-11466
 URL: https://issues.apache.org/jira/browse/HDDS-11466
 Project: Apache Ozone
  Issue Type: Improvement
  Components: metrics
Reporter: Tsz-wo Sze


Below is an example:
{code}
# hadoop fs  -Dfs.s3a.bucket.probe=0 
-Dfs.s3a.change.detection.version.required=false 
-Dfs.s3a.change.detection.mode=none -Dfs.s3a.endpoint=http://some.site:9878 
-Dfs.s3a.access.keysome=systest -Dfs.s3a.secret.key=8...1 
-Dfs.s3a.endpoint=http://some.site:9878  -Dfs.s3a.path.style.access=true 
-Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   -ls  -R s3a://bucket1/
24/09/17 10:47:48 WARN impl.MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
24/09/17 10:47:48 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period 
at 10 second(s).
24/09/17 10:47:48 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
started
24/09/17 10:47:48 WARN impl.ConfigurationHelper: Option 
fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 ms 
instead
24/09/17 10:47:50 WARN s3.S3TransferManager: The provided S3AsyncClient is an 
instance of MultipartS3AsyncClient, and thus multipart download feature is not 
enabled. To benefit from all features, consider using 
S3AsyncClient.crtBuilder().build() instead.
drwxrwxrwx   - root root  0 2024-09-17 10:47 s3a://bucket1/dir1
24/09/17 10:47:53 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics 
system...
24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
stopped.
24/09/17 10:47:53 INFO impl.MetricsSystemImpl: s3a-file-system metrics system 
shutdown complete. 
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Commented] (RATIS-2156) Notify follower slowness based on the log index

2024-09-16 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882101#comment-17882101
 ] 

Tsz-wo Sze commented on RATIS-2156:
---

Agree.  The "StatusRuntimeException: CANCELLED: RST_STREAM closed stream. 
HTTP/2 error code: CANCEL" message seems to be caused by RATIS-2135. 

> Notify follower slowness based on the log index
> ---
>
> Key: RATIS-2156
> URL: https://issues.apache.org/jira/browse/RATIS-2156
> Project: Ratis
>  Issue Type: Improvement
>  Components: Leader
>Reporter: Ivan Andika
>Assignee: Ivan Andika
>Priority: Major
> Attachments: image-2024-09-13-18-54-04-203.png
>
>
> Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on 
> raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases 
> where the rpc rtt between the leader and follower does not exceed the 
> timeout, the difference of the log index between the leader and follower 
> keeps increasing, i.e. the slow follower cannot catch up.
> In Ozone, this causes most watch requests with ALL_COMMITTED replication to 
> timeout, causing increased latency of writes. It is better to close the 
> pipeline if the slow follower cannot catch up.
> !image-2024-09-13-18-54-04-203.png|width=1408,height=244!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (RATIS-2159) TestRaftWithSimulatedRpc could "fail to retain".

2024-09-15 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created RATIS-2159:
-

 Summary: TestRaftWithSimulatedRpc could "fail to retain".
 Key: RATIS-2159
 URL: https://issues.apache.org/jira/browse/RATIS-2159
 Project: Ratis
  Issue Type: Bug
Reporter: Tsz-wo Sze


{code}
Error:  Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 42.31 s 
<<< FAILURE! - in org.apache.ratis.server.simulation.TestRaftWithSimulatedRpc
Error:  
org.apache.ratis.server.simulation.TestRaftWithSimulatedRpc.testWithLoad  Time 
elapsed: 8.47 s  <<< ERROR!
java.lang.IllegalStateException: Failed to retain: object has already been 
completely released.
at 
org.apache.ratis.util.ReferenceCountedLeakDetector$Impl.retain(ReferenceCountedLeakDetector.java:116)
at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.retainLog(SegmentedRaftLog.java:310)
at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:285)
at 
org.apache.ratis.RaftTestUtil.logEntriesContains(RaftTestUtil.java:187)
at 
org.apache.ratis.RaftTestUtil.logEntriesContains(RaftTestUtil.java:172)
at 
org.apache.ratis.RaftTestUtil.lambda$assertLogEntries$5(RaftTestUtil.java:250)
at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
...
at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:593)
at org.apache.ratis.RaftTestUtil.assertLogEntries(RaftTestUtil.java:251)
at org.apache.ratis.RaftTestUtil.assertLogEntries(RaftTestUtil.java:242)
at org.apache.ratis.RaftBasicTests.testWithLoad(RaftBasicTests.java:424)
at 
org.apache.ratis.RaftBasicTests.lambda$testWithLoad$8(RaftBasicTests.java:344)
at 
org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:143)
at 
org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:121)
at org.apache.ratis.RaftBasicTests.testWithLoad(RaftBasicTests.java:344)
...
{code}
See 
https://github.com/apache/ratis/actions/runs/10865610568/job/30154388685?pr=1150#step:5:741




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2127) TestRetryCacheWithGrpc may fail with object already completely released.

2024-09-15 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881861#comment-17881861
 ] 

Tsz-wo Sze commented on RATIS-2127:
---

TestRaftWithSimulatedRpc could also fail although the exception stack trace is 
different from here; filed RATIS-2159.

> TestRetryCacheWithGrpc may fail with object already completely released.
> 
>
> Key: RATIS-2127
> URL: https://issues.apache.org/jira/browse/RATIS-2127
> Project: Ratis
>  Issue Type: Sub-task
>  Components: gRPC
>Reporter: Tsz-wo Sze
>Assignee: Duong
>Priority: Blocker
>
> Found IllegalStateException: Failed to release: object has already been 
> completely released.
> {code}
> 2024-06-04 12:00:35,728 
> [s0@group-11B7B1EB32F8->s4-GrpcLogAppender-LogAppenderDaemon] WARN  
> leader.LogAppenderDaemon (LogAppenderDaemon.java:run(89)) - 
> s0@group-11B7B1EB32F8->s4-GrpcLogAppender-LogAppenderDaemon failed
> java.lang.IllegalStateException: Failed to release: object has already been 
> completely released.
>   at 
> org.apache.ratis.util.ReferenceCountedLeakDetector$Impl.release(ReferenceCountedLeakDetector.java:130)
>   at 
> org.apache.ratis.util.ReferenceCountedLeakDetector$SimpleTracing.release(ReferenceCountedLeakDetector.java:152)
>   at 
> org.apache.ratis.util.ReferenceCountedObject$3.release(ReferenceCountedObject.java:150)
>   at 
> org.apache.ratis.util.ReferenceCountedObject$2.release(ReferenceCountedObject.java:122)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:414)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:262)
>   at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>   at java.lang.Thread.run(Thread.java:750)
> {code}
> See [the 
> logs|https://issues.apache.org/jira/secure/attachment/13069289/TestRetryCacheWithGrpc.tar.gz].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2156) Notify follower slowness based on the log index

2024-09-13 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2156:
--
Component/s: Leader

Thanks for filing the JIRA and working on this!

> Notify follower slowness based on the log index
> ---
>
> Key: RATIS-2156
> URL: https://issues.apache.org/jira/browse/RATIS-2156
> Project: Ratis
>  Issue Type: Improvement
>  Components: Leader
>Reporter: Ivan Andika
>Assignee: Ivan Andika
>Priority: Major
> Attachments: image-2024-09-13-18-54-04-203.png
>
>
> Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on 
> raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases 
> where the rpc rtt between the leader and follower does not exceed the 
> timeout, the difference of the log index between the leader and follower 
> keeps increasing, i.e. the slow follower cannot catch up.
> In Ozone, this causes most watch requests with ALL_COMMITTED replication to 
> timeout, causing increased latency of writes. It is better to close the 
> pipeline if the slow follower cannot catch up.
> !image-2024-09-13-18-54-04-203.png|width=1408,height=244!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (RATIS-2155) Add a builder for RatisShell

2024-09-12 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created RATIS-2155:
-

 Summary: Add a builder for RatisShell
 Key: RATIS-2155
 URL: https://issues.apache.org/jira/browse/RATIS-2155
 Project: Ratis
  Issue Type: New Feature
  Components: shell
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


Currently, RatisShell is executed via CLI.  It will use the default 
RaftProperties and a null Parameters to build a RaftClient.  There is no way to 
pass TlsConf, as a result, RatisShell cannot access secure clusters.

This JIRA is to add a builder in order to pass RaftProperties and Parameters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2152) GrpcLogAppender stucks while sending an installSnapshot notification request

2024-09-12 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2152.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~wfps1210]!

> GrpcLogAppender stucks while sending an installSnapshot notification request
> 
>
> Key: RATIS-2152
> URL: https://issues.apache.org/jira/browse/RATIS-2152
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Chung En Lee
>Assignee: Chung En Lee
>Priority: Major
> Fix For: 3.2.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In `GrpcLogAppender`, it waits for signal at the end of 
> `notifyInstallSnapshot` as following.
> [https://github.com/apache/ratis/blob/master/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L825-L831]
> However, checking whether the `InstallSnapshotResponseHandler` is done and 
> the call `AwaitForSignal.await()` are not atomic. This creates a potential 
> race condition where InstallSnapshotResponseHandler.close() could finish 
> after the check but before the wait, causing that `GrpcLogAppender` is still 
> waiting even though `InstallSnapshotResponseHandler` has already completed, 
> leading to timeout. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2152) GrpcLogAppender stucks while sending an installSnapshot notification request

2024-09-12 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2152:
--
Component/s: gRPC

> GrpcLogAppender stucks while sending an installSnapshot notification request
> 
>
> Key: RATIS-2152
> URL: https://issues.apache.org/jira/browse/RATIS-2152
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Chung En Lee
>Assignee: Chung En Lee
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In `GrpcLogAppender`, it waits for signal at the end of 
> `notifyInstallSnapshot` as following.
> [https://github.com/apache/ratis/blob/master/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L825-L831]
> However, checking whether the `InstallSnapshotResponseHandler` is done and 
> the call `AwaitForSignal.await()` are not atomic. This creates a potential 
> race condition where InstallSnapshotResponseHandler.close() could finish 
> after the check but before the wait, causing that `GrpcLogAppender` is still 
> waiting even though `InstallSnapshotResponseHandler` has already completed, 
> leading to timeout. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2154) The old leader may send appendEntries after term changed

2024-09-12 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2154.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~tohsakarin__]!

> The old leader may send appendEntries after term changed
> 
>
> Key: RATIS-2154
> URL: https://issues.apache.org/jira/browse/RATIS-2154
> Project: Ratis
>  Issue Type: Wish
>  Components: Leader
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-09-12-09-43-30-670.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The leader will become a follower after receiving a higher term, but during 
> this process, the old leader may be appending LogEntry, and the error log 
> will be printed until LogAppenderDaemon is closed.
> !image-2024-09-12-09-43-30-670.png!
>  
> I think we can put state.updateCurrentTerm (newTerm) later. Close LeaderState 
> first before updating the term, and other operations remain unchanged.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (RATIS-2154) The old leader may send appendEntries after term changed

2024-09-12 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-2154:
-

Component/s: Leader
   Assignee: yuuka
Summary: The old leader may send appendEntries after term changed  
(was: Discussion on follow-up when deleting unnecessary error logs GC severe in 
JVMPauseMonitor)

> The old leader may send appendEntries after term changed
> 
>
> Key: RATIS-2154
> URL: https://issues.apache.org/jira/browse/RATIS-2154
> Project: Ratis
>  Issue Type: Wish
>  Components: Leader
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Attachments: image-2024-09-12-09-43-30-670.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The leader will become a follower after receiving a higher term, but during 
> this process, the old leader may be appending LogEntry, and the error log 
> will be printed until LogAppenderDaemon is closed.
> !image-2024-09-12-09-43-30-670.png!
>  
> I think we can put state.updateCurrentTerm (newTerm) later. Close LeaderState 
> first before updating the term, and other operations remain unchanged.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2149) Do not perform leader election if the current RaftServer has not started yet

2024-09-06 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2149.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~tohsakarin__]!

> Do not perform leader election if the current RaftServer has not started yet
> 
>
> Key: RATIS-2149
> URL: https://issues.apache.org/jira/browse/RATIS-2149
> Project: Ratis
>  Issue Type: Improvement
>  Components: election
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-09-03-17-41-41-872.png, 
> image-2024-09-03-18-13-50-628.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Sometimes we cannot guarantee that the program will run normally in various 
> environments, and appropriate robustness enhancement may be necessary.
> Before adding members, RaftServer S and the corresponding group will be 
> created if the group is not exist and find that the interval between these 
> two logs is more than one minute.
> !image-2024-09-03-17-41-41-872.png!
>  
> Since our RpcTimeout is small 1 minute, the retryPolicy has already started, 
> but S's groupId is already in the implMaps of RaftServerProxy, which will 
> throw AlreadyExistException. When we catch this exception, we assume that the 
> creation has been completed and the member change can be executed
>  
> S is still in the initializing state, and this member change will not be 
> completed. Finally, we found that S started the election and received 
> NOT_IN_CONF reply, and then S will be closed
> !image-2024-09-03-18-13-50-628.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2148) Snapshot transfer may cause followers to trigger reloadStateMachine incorrectly

2024-09-06 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2148.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~tohsakarin__] !

> Snapshot transfer may cause followers to trigger reloadStateMachine 
> incorrectly
> ---
>
> Key: RATIS-2148
> URL: https://issues.apache.org/jira/browse/RATIS-2148
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Affects Versions: 3.1.0
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-09-03-14-24-25-652.png, 
> image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, 
> image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, 
> image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Due to the fact that grpc streaming snapshot sending sends all requests at 
> once, error handling is performed after all are sent, and the last snapshot 
> request is used as a completion flag, which may lead to the successful 
> receipt of the last request, but the previous request has failed. The sender 
> handles the failure event during the retransmission of the snapshot. The 
> receiver triggers state.reloadStateMachine because it successfully receives 
> the last request, but due to incomplete snapshot reception
>  
> An md5 mismatch exception occurred before the last SnapshotRequest was 
> received
> !image-2024-09-03-14-27-39-406.png!
>  
> The last snapshot request arrived, then successfully received, and then 
> updated the index.
> !image-2024-09-03-14-28-31-529.png!
> !image-2024-09-03-14-30-02-751.png!
>  
> However, the snapshot reception is incomplete and triggers the 
> reloadStateMachine.
> !image-2024-09-03-14-33-49-573.png!
>  
> I suggest using a flag to identify whether the entire snapshot request is 
> abnormal.
> If an exception occurs, the subsequent content of the request will not be 
> processed.
> Or the sender will wait for the receiver's reply. If there is a release 
> error, resend it.
>  
> Finally, the current error retry level is the entire snapshot directory 
> rather than a single chunk, which will cause a large number of snapshot files 
> to be sent repeatedly, which can be optimized later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (RATIS-2151) TestRaftWithGrpc may fail after RATIS-2129

2024-09-06 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created RATIS-2151:
-

 Summary: TestRaftWithGrpc may fail after RATIS-2129
 Key: RATIS-2151
 URL: https://issues.apache.org/jira/browse/RATIS-2151
 Project: Ratis
  Issue Type: Task
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze



- After RATIS-2129: 
TestRaftWithGrpc#[781d61d37411b374f104eb0806e1e2c4090fb35e]-10x10: 91/100 
failures
https://github.com/szetszwo/ratis/actions/runs/10747241634/job/29810232738

- Before RATIS-2129: 
TestRaftWithGrpc#[dfed1012983d1d7b5fb2c408e19b8661cbe000b4]-10x10 success
https://github.com/szetszwo/ratis/actions/runs/10746526581




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception

2024-09-05 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879663#comment-17879663
 ] 

Tsz-wo Sze commented on RATIS-2137:
---

[~lemony], are you going to build it yourself?  Or do you need a release based 
on 2.5.1?  Please feel free to share your thought.

> Leader fails to send correct index to follower after timeout exception
> --
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.5.1
>Reporter: Kevin Liu
>Assignee: Kevin Liu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-08-13-11-28-16-250.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 3|rpc:xxx:9862|admin:|c

[jira] [Updated] (RATIS-2150) No need for manual assembly:single execution when mvn depoly

2024-09-04 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2150:
--
Component/s: build

> No need for manual assembly:single execution when mvn depoly
> 
>
> Key: RATIS-2150
> URL: https://issues.apache.org/jira/browse/RATIS-2150
> Project: Ratis
>  Issue Type: Improvement
>  Components: build
>Reporter: Xinyu Tan
>Assignee: Xinyu Tan
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This [RATIS-2117|https://issues.apache.org/jira/browse/RATIS-2117] ignores 
> the mvn deoply command update, which will be updated in this ISSUE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (RATIS-2147) MD5 mismatch when accept snapshot

2024-09-04 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-2147:
-

Assignee: yuuka

> MD5 mismatch when accept snapshot
> -
>
> Key: RATIS-2147
> URL: https://issues.apache.org/jira/browse/RATIS-2147
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Affects Versions: 3.1.0, 3.2.0
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Attachments: image-2024-09-03-10-35-08-315.png, 
> image-2024-09-03-10-35-28-617.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We encountered an MD5 mismatch issue in IoTDB, and after multiple 
> investigations, we found that the digester was contaminated
>  
> We have checked that it is not a network and disk problem
>  
> In implementation, the received snapshot will be written to a temporary file 
> first. If there is an md5 mismatch, we will read the data from this temporary 
> file and use a new digest to calculate md5, but the result of this 
> calculation is the same as the md5 hash value sent
> !image-2024-09-03-10-35-28-617.png!
>  
> !image-2024-09-03-10-35-08-315.png!
>  
>  
> Use the saved corrupted file name to locate the relevant log, here to 
> tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA!
> Before encountering corrupt, the sender sent several consecutive snapshot 
> installation requests to the receiver.
>  
> The receiver successfully received some requests, and then encountered a 
> request for corrupt, and began printing "recompute again" to start 
> recalculating.
>  
> After execution, the ERROR log of the rename will be printed, and the data 
> will be read from the file and compared with the received chunk data.
>  
> If a byte does not match, the corresponding information will be printed, but 
> no log information will be printed, which means that the content written to 
> the disk is the same as the content sent
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA!
> This makes the problem very clear. There is a problem with the MD5 
> calculation class, and the reasons are as follows：
>  
>      If a byte in the middle of the data part is incorrect due to network 
> reasons, the calculated result and the hash sent must be different
>  
>     If there is a problem with the part that stores the hash value, the final 
> calculation result will also be different.
>  
> I suggest creating a new digest every time follower receive a snapshot, so as 
> to avoid pollution problems. Under normal network and disk conditions, 
> Corrupt will not occur



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (RATIS-2149) Do not perform leader election if the current RaftServer has not started yet

2024-09-04 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-2149:
-

Component/s: election
   Assignee: yuuka
 Issue Type: Improvement  (was: Wish)

> Do not perform leader election if the current RaftServer has not started yet
> 
>
> Key: RATIS-2149
> URL: https://issues.apache.org/jira/browse/RATIS-2149
> Project: Ratis
>  Issue Type: Improvement
>  Components: election
>Reporter: yuuka
>Assignee: yuuka
>Priority: Major
> Attachments: image-2024-09-03-17-41-41-872.png, 
> image-2024-09-03-18-13-50-628.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Sometimes we cannot guarantee that the program will run normally in various 
> environments, and appropriate robustness enhancement may be necessary.
> Before adding members, RaftServer S and the corresponding group will be 
> created if the group is not exist and find that the interval between these 
> two logs is more than one minute.
> !image-2024-09-03-17-41-41-872.png!
>  
> Since our RpcTimeout is small 1 minute, the retryPolicy has already started, 
> but S's groupId is already in the implMaps of RaftServerProxy, which will 
> throw AlreadyExistException. When we catch this exception, we assume that the 
> creation has been completed and the member change can be executed
>  
> S is still in the initializing state, and this member change will not be 
> completed. Finally, we found that S started the election and received 
> NOT_IN_CONF reply, and then S will be closed
> !image-2024-09-03-18-13-50-628.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception

2024-09-04 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879295#comment-17879295
 ] 

Tsz-wo Sze commented on RATIS-2137:
---

For 2.5.1, let's also cherry-pick RATIS-1902.  Just have tried it.  The code 
conflict is minor.

After that, this (RATIS-2137) can be cherry-picked cleanly.

> Leader fails to send correct index to follower after timeout exception
> --
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.5.1
>Reporter: Kevin Liu
>Assignee: Kevin Liu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-08-13-11-28-16-250.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>

[jira] [Commented] (RATIS-2148) GRPC streaming snapshot transfer may cause followers to trigger reloadStateMachine incorrectly

2024-09-03 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878970#comment-17878970
 ] 

Tsz-wo Sze commented on RATIS-2148:
---

That's great, thanks!

> GRPC streaming snapshot transfer may cause followers to trigger 
> reloadStateMachine incorrectly
> --
>
> Key: RATIS-2148
> URL: https://issues.apache.org/jira/browse/RATIS-2148
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Affects Versions: 3.1.0, 3.2.0
>Reporter: yuuka
>Priority: Major
> Attachments: image-2024-09-03-14-24-25-652.png, 
> image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, 
> image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, 
> image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
>
>
> Due to the fact that grpc streaming snapshot sending sends all requests at 
> once, error handling is performed after all are sent, and the last snapshot 
> request is used as a completion flag, which may lead to the successful 
> receipt of the last request, but the previous request has failed. The sender 
> handles the failure event during the retransmission of the snapshot. The 
> receiver triggers state.reloadStateMachine because it successfully receives 
> the last request, but due to incomplete snapshot reception
>  
> An md5 mismatch exception occurred before the last SnapshotRequest was 
> received
> !image-2024-09-03-14-27-39-406.png!
>  
> The last snapshot request arrived, then successfully received, and then 
> updated the index.
> !image-2024-09-03-14-28-31-529.png!
> !image-2024-09-03-14-30-02-751.png!
>  
> However, the snapshot reception is incomplete and triggers the 
> reloadStateMachine.
> !image-2024-09-03-14-33-49-573.png!
>  
> I suggest using a flag to identify whether the entire snapshot request is 
> abnormal.
> If an exception occurs, the subsequent content of the request will not be 
> processed.
> Or the sender will wait for the receiver's reply. If there is a release 
> error, resend it.
>  
> Finally, the current error retry level is the entire snapshot directory 
> rather than a single chunk, which will cause a large number of snapshot files 
> to be sent repeatedly, which can be optimized later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2148) GRPC streaming snapshot transfer may cause followers to trigger reloadStateMachine incorrectly

2024-09-03 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878962#comment-17878962
 ] 

Tsz-wo Sze commented on RATIS-2148:
---

https://github.com/apache/ratis/blob/8f5159db4cade67b96c7b9c8589e7c0cdba571e0/ratis-server/src/main/java/org/apache/ratis/server/impl/SnapshotInstallationHandler.java#L187-L191

{code}
//SnapshotInstallationHandler.java
// update the committed index
// re-load the state machine if this is the last chunk
if (snapshotChunkRequest.getDone()) {
  state.reloadStateMachine(lastIncluded);
}
{code}
You are right.  The SnapshotInstallationHandler code above should check if the 
request is completed successfully before calling reloadStateMachine(..).  Would 
you like to provide a pull request? 



> GRPC streaming snapshot transfer may cause followers to trigger 
> reloadStateMachine incorrectly
> --
>
> Key: RATIS-2148
> URL: https://issues.apache.org/jira/browse/RATIS-2148
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Affects Versions: 3.1.0, 3.2.0
>Reporter: yuuka
>Priority: Major
> Attachments: image-2024-09-03-14-24-25-652.png, 
> image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, 
> image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, 
> image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
>
>
> Due to the fact that grpc streaming snapshot sending sends all requests at 
> once, error handling is performed after all are sent, and the last snapshot 
> request is used as a completion flag, which may lead to the successful 
> receipt of the last request, but the previous request has failed. The sender 
> handles the failure event during the retransmission of the snapshot. The 
> receiver triggers state.reloadStateMachine because it successfully receives 
> the last request, but due to incomplete snapshot reception
>  
> An md5 mismatch exception occurred before the last SnapshotRequest was 
> received
> !image-2024-09-03-14-27-39-406.png!
>  
> The last snapshot request arrived, then successfully received, and then 
> updated the index.
> !image-2024-09-03-14-28-31-529.png!
> !image-2024-09-03-14-30-02-751.png!
>  
> However, the snapshot reception is incomplete and triggers the 
> reloadStateMachine.
> !image-2024-09-03-14-33-49-573.png!
>  
> I suggest using a flag to identify whether the entire snapshot request is 
> abnormal.
> If an exception occurs, the subsequent content of the request will not be 
> processed.
> Or the sender will wait for the receiver's reply. If there is a release 
> error, resend it.
>  
> Finally, the current error retry level is the entire snapshot directory 
> rather than a single chunk, which will cause a large number of snapshot files 
> to be sent repeatedly, which can be optimized later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2147) MD5 missmatch when accept snapshot

2024-09-02 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878701#comment-17878701
 ] 

Tsz-wo Sze commented on RATIS-2147:
---

[~tohsakarin__], thanks for reporting the problem!  Would you like to submit a 
pull request?

> MD5 missmatch when accept snapshot
> --
>
> Key: RATIS-2147
> URL: https://issues.apache.org/jira/browse/RATIS-2147
> Project: Ratis
>  Issue Type: Bug
>  Components: snapshot
>Affects Versions: 3.1.0, 3.2.0
>Reporter: yuuka
>Priority: Major
> Attachments: image-2024-09-03-10-35-08-315.png, 
> image-2024-09-03-10-35-28-617.png
>
>
> We encountered an MD5 mismatch issue in IoTDB, and after multiple 
> investigations, we found that the digester was contaminated
>  
> We have checked that it is not a network and disk problem
>  
> In implementation, the received snapshot will be written to a temporary file 
> first. If there is an md5 mismatch, we will read the data from this temporary 
> file and use a new digest to calculate md5, but the result of this 
> calculation is the same as the md5 hash value sent
> !image-2024-09-03-10-35-28-617.png!
>  
> !image-2024-09-03-10-35-08-315.png!
>  
>  
> Use the saved corrupted file name to locate the relevant log, here to 
> tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YjM4MWY1MTA2Y2EyYWU4MmZlNDE0Mzg3MDRjYTBjMjRfU0dPbEpVbWFNalV1V1lSUVllOGFISUdWbUhqanRFdFdfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA!
> Before encountering corrupt, the sender sent several consecutive snapshot 
> installation requests to the receiver.
>  
> The receiver successfully received some requests, and then encountered a 
> request for corrupt, and began printing "recompute again" to start 
> recalculating.
>  
> After execution, the ERROR log of the rename will be printed, and the data 
> will be read from the file and compared with the received chunk data.
>  
> If a byte does not match, the corresponding information will be printed, but 
> no log information will be printed, which means that the content written to 
> the disk is the same as the content sent
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YmZlYjk1YjAwOWE4MDJlYTEzZjkxMjljODU1MzQxMTZfMkU0NmlPRWpidDBweGNzWXY4cHNJZG14b1o3Z1BZMzhfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA!
> This makes the problem very clear. There is a problem with the MD5 
> calculation class, and the reasons are as follows：
>  
>      If a byte in the middle of the data part is incorrect due to network 
> reasons, the calculated result and the hash sent must be different
>  
>     If there is a problem with the part that stores the hash value, the final 
> calculation result will also be different.
>  
> I suggest creating a new digest every time follower receive a snapshot, so as 
> to avoid pollution problems. Under normal network and disk conditions, 
> Corrupt will not occur



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2146) Fixed possible issues caused by concurrent deletion and election when member changes

2024-09-02 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2146.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~tanxinyu]!

> Fixed possible issues caused by concurrent deletion and election when member 
> changes
> 
>
> Key: RATIS-2146
> URL: https://issues.apache.org/jira/browse/RATIS-2146
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Xinyu Tan
>Assignee: Xinyu Tan
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-08-28-14-53-23-259.png, 
> image-2024-08-28-14-53-27-637.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> During this process, we encountered some concurrency issues:
> * After the member change is complete, node D will no longer be a member of 
> this consensus group. It will attempt to initiate an election but receive a 
> NOT_IN_CONF response, after which it will close itself.
> * During the removal of member D, it will also close itself first, and then 
> proceed to delete the file directory.
> These two CLOSE operations may occur concurrently, which could result in the 
> directory being deleted while the StateMachineUpdater thread has not yet 
> closed, ultimately leading to unexpected errors.
>  !image-2024-08-28-14-53-23-259.png! 
>  !image-2024-08-28-14-53-27-637.png! 
> I believe there are two possible solutions for this issue:
> * Add concurrency control to the close function, such as adding the 
> synchronized keyword to the function.
> * Add some checks before deleting the directory to ensure that the callback 
> functions in the close process have already been executed before the 
> directory is deleted.
> What's your opinion? [~szetszwo]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2146) Fixed possible issues caused by concurrent deletion and election when member changes

2024-09-02 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2146:
--
Component/s: server

> Fixed possible issues caused by concurrent deletion and election when member 
> changes
> 
>
> Key: RATIS-2146
> URL: https://issues.apache.org/jira/browse/RATIS-2146
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Xinyu Tan
>Assignee: Xinyu Tan
>Priority: Major
> Attachments: image-2024-08-28-14-53-23-259.png, 
> image-2024-08-28-14-53-27-637.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> During this process, we encountered some concurrency issues:
> * After the member change is complete, node D will no longer be a member of 
> this consensus group. It will attempt to initiate an election but receive a 
> NOT_IN_CONF response, after which it will close itself.
> * During the removal of member D, it will also close itself first, and then 
> proceed to delete the file directory.
> These two CLOSE operations may occur concurrently, which could result in the 
> directory being deleted while the StateMachineUpdater thread has not yet 
> closed, ultimately leading to unexpected errors.
>  !image-2024-08-28-14-53-23-259.png! 
>  !image-2024-08-28-14-53-27-637.png! 
> I believe there are two possible solutions for this issue:
> * Add concurrency control to the close function, such as adding the 
> synchronized keyword to the function.
> * Add some checks before deleting the directory to ensure that the callback 
> functions in the close process have already been executed before the 
> directory is deleted.
> What's your opinion? [~szetszwo]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2129) Low replication performance because of lock contention on RaftLog

2024-08-30 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2129.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request [#1141|https://github.com/apache/ratis/pull/1141] is now 
merged to the master branch.

> Low replication performance because of lock contention on RaftLog
> -
>
> Key: RATIS-2129
> URL: https://issues.apache.org/jira/browse/RATIS-2129
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.1.0
>Reporter: Duong
>Assignee: Tsz-wo Sze
>Priority: Blocker
>  Labels: Performance, performance
> Fix For: 3.2.0
>
> Attachments: Screenshot 2024-07-22 at 4.40.07 PM-1.png, Screenshot 
> 2024-07-22 at 4.40.07 PM.png, dn_echo_leader_profile.html, 
> image-2024-07-22-15-25-46-155.png, ratis_ratfLog_lock_contention.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Today, the GrpcLogAppender thread makes a lot of calls that need RaftLog's 
> readLock. In an active environment, RaftLog is always busy appending 
> transactions from clients, thus writeLock is frequently busy. This makes the 
> replication performance slow. 
> See the [^dn_echo_leader_profile.html], or in the picture below, the purple 
> is the time taken to acquire readLock from RaftLog.
>  # !image-2024-07-22-15-25-46-155.png|width=854,height=425!
> h2. A summary of LockContention in Ratis. 
> h2.  
> !ratis_ratfLog_lock_contention.png|width=392,height=380!
> Today, RaftLog consistency is protected by a global ReadWriteLock. (global 
> means RaftLog has a single ReadWriteLock and the lock is acquired at the 
> scope of the RaftLog instance, or a RaftGroup).
> In a RaftGroup, the following actors race to obtain this global ReadWriteLock 
> in the leader node:
>  * The writer, which is the GRPC Client Service, accepts transaction 
> submissions from Raft clients and appends transactions (or log entries) to 
> RaftLog. Each append operation needs to acquire the writeLock from RaftLog to 
> put the transaction to RaftLog's memory queue. Although each of these append 
> operations is quick, Ratis is designed to maximize transactions append and so 
> the writeLock should be always busy.
>  * StateMachineUpdater. For each transaction, when it is acknowledged by 
> enough followers, this single thread actor will read the log from RaftLog and 
> call StateMachine to apply the transaction. This actor acquires readLock from 
> RaftLog for each log entry read. 
>  * GrpcLogAppender: for each follower, there's a thread of GrpcLogAppender 
> that constantly reads log entries from RaftLog and replicates them to the 
> follower. This thread acquires readLock from RaftLog every time it reads a 
> log entry.
> All writer, StateMachineUpdater, and GrpcLogAppender are all designed in a 
> way to maximize their throughput. For instance,  StateMachineUpdater invokes 
> StateMachine's applyTransaction as asynchronous calls. The same is the way 
> GrpcLogAppender replicates log entries to the follower. 
> The global ReadWriteLock *creates a tough contention* between the RaftLog 
> writers and readers. And that's what limit the ratis throughput down. The 
> faster the writers and readers are, the more they block each other.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2145) Follower hangs until the next trigger to take a snapshot

2024-08-30 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2145.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~z-bb]!

>  Follower hangs until the next trigger to take a snapshot
> -
>
> Key: RATIS-2145
> URL: https://issues.apache.org/jira/browse/RATIS-2145
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Assignee: guangbao zhao
>Priority: Major
> Fix For: 3.2.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We discovered a problem when writing tests with high concurrency. It often 
> happens that a follower is running well and then triggers takeSnalshot.
> The following is the relevant log.
> follower: （as the follower log says, between 2024/08/22 20:18:14,044 and 
> 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the 
> follower election was not triggered, indicating that the leader gave The 
> heartbeat sent by the follower is successful）
> {code:java}
> 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO 
> org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly 
> 22436696498 -> 22441096501
> 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment 
> /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615
> 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax 
> old=22432683959, new=22437078979, updated? true
> 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO 
> com.xxx.RaftJournalManager: Received install snapshot notification from 
> MetaStore leader: node3 with term index: (t:192, i:22441477801)
> 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO 
> com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from 
> Leader MetaStore node3. Checkpoint address: leader:8170
> 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, 
> i:22441477801)
> 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/22 20:21:57,067 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed 
> appendEntries as snapshot (22441477801) installation is in progress
> 2024/08/22 20:21:57,068 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code}
> leader:
> {code:java}
> 2024/08/22 20:18:16,958 [timer5] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, 
> i:22441098598)...(t:192, i:22441098622)
> 2024/08/22 20:18:16,964 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, 
> i:22441098624)
> 2024/08/22 20:18:16,964 [timer6] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, 
> i:22441098625)
> 2024/08/22 20:18:16,964 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, 
> i:22441098623)
> 2024/08/22 20:18:16,965 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192, 
> i:22441098627)
> 2024/08/22 20:18:16,965 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appe

[jira] [Commented] (RATIS-2145) Follower hangs until the next trigger to take a snapshot

2024-08-29 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877751#comment-17877751
 ] 

Tsz-wo Sze commented on RATIS-2145:
---

bq. What's the problem if it's not the lowest index? can't we ensure this 
problem through follower's checkInconsistentAppendEntries?

You are right that checkInconsistentAppendEntries ensures follower log entries 
are in correct order.  It seems also okay if it is not the lowest index.  The 
leader will just miss the reply so it won't update indices. As long as, the 
leader gets a SUCCESS reply, it knows that all the previous entries are correct.

That's a good idea!  Could submit a pull request?

>  Follower hangs until the next trigger to take a snapshot
> -
>
> Key: RATIS-2145
> URL: https://issues.apache.org/jira/browse/RATIS-2145
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Assignee: guangbao zhao
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We discovered a problem when writing tests with high concurrency. It often 
> happens that a follower is running well and then triggers takeSnalshot.
> The following is the relevant log.
> follower: （as the follower log says, between 2024/08/22 20:18:14,044 and 
> 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the 
> follower election was not triggered, indicating that the leader gave The 
> heartbeat sent by the follower is successful）
> {code:java}
> 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO 
> org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly 
> 22436696498 -> 22441096501
> 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment 
> /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615
> 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax 
> old=22432683959, new=22437078979, updated? true
> 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO 
> com.xxx.RaftJournalManager: Received install snapshot notification from 
> MetaStore leader: node3 with term index: (t:192, i:22441477801)
> 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO 
> com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from 
> Leader MetaStore node3. Checkpoint address: leader:8170
> 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, 
> i:22441477801)
> 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/22 20:21:57,067 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed 
> appendEntries as snapshot (22441477801) installation is in progress
> 2024/08/22 20:21:57,068 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code}
> leader:
> {code:java}
> 2024/08/22 20:18:16,958 [timer5] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, 
> i:22441098598)...(t:192, i:22441098622)
> 2024/08/22 20:18:16,964 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, 
> i:22441098624)
> 2024/08/22 20:18:16,964 [timer6] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, 
> i:22441098625)
> 2024/08/22 20:18:16,964 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, 
> i:22441098623)
> 2

[jira] [Commented] (RATIS-2145) Follower hangs until the next trigger to take a snapshot

2024-08-28 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877570#comment-17877570
 ] 

Tsz-wo Sze commented on RATIS-2145:
---

Hi [~z-bb], thanks a lot for filing this bug!  Some questions/comments:

bq. Because the leader did not receive the onNext callback within the 
requestTimeoutDuration（3s）time ...

Why?  Gc?  Network issues?

It may need a longer timeout.  Apache Ozone uses 30s timeout by default.

bq. ... Is it possible to replace handleTimeout with the remove method here, ...

I guess you mean to remove the request from pendingRequests when timeout?  It 
probably is okay if the request is the request with the lowest index in 
pendingRequests.



>  Follower hangs until the next trigger to take a snapshot
> -
>
> Key: RATIS-2145
> URL: https://issues.apache.org/jira/browse/RATIS-2145
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Assignee: guangbao zhao
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We discovered a problem when writing tests with high concurrency. It often 
> happens that a follower is running well and then triggers takeSnalshot.
> The following is the relevant log.
> follower: （as the follower log says, between 2024/08/22 20:18:14,044 and 
> 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the 
> follower election was not triggered, indicating that the leader gave The 
> heartbeat sent by the follower is successful）
> {code:java}
> 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO 
> org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly 
> 22436696498 -> 22441096501
> 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment 
> /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615
> 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax 
> old=22432683959, new=22437078979, updated? true
> 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO 
> com.xxx.RaftJournalManager: Received install snapshot notification from 
> MetaStore leader: node3 with term index: (t:192, i:22441477801)
> 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO 
> com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from 
> Leader MetaStore node3. Checkpoint address: leader:8170
> 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, 
> i:22441477801)
> 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/22 20:21:57,067 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed 
> appendEntries as snapshot (22441477801) installation is in progress
> 2024/08/22 20:21:57,068 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code}
> leader:
> {code:java}
> 2024/08/22 20:18:16,958 [timer5] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, 
> i:22441098598)...(t:192, i:22441098622)
> 2024/08/22 20:18:16,964 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, 
> i:22441098624)
> 2024/08/22 20:18:16,964 [timer6] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, 
> i:22441098625)
> 2024/08/22 20:18:16,964 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867245,entriesCount=1,ent

[jira] [Commented] (HDDS-11382) Remove the use of caniuse-lite

2024-08-28 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877483#comment-17877483
 ] 

Tsz-wo Sze commented on HDDS-11382:
---

Note that caniuse-api (MIT) is okay to use but caniuse-lite (CC 4.0) is not. 

- caniuse-lite (CC 4.0); see LEGAL-678
-* https://github.com/browserslist/caniuse-lite/blob/main/LICENSE
- caniuse-api (MIT)
-* https://github.com/Nyalab/caniuse-api/blob/master/LICENSE


> Remove the use of caniuse-lite
> --
>
> Key: HDDS-11382
> URL: https://issues.apache.org/jira/browse/HDDS-11382
> Project: Apache Ozone
>  Issue Type: Task
>  Components: Ozone Recon
>Reporter: Tsz-wo Sze
>Priority: Blocker
>
> After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite
> - 
> ./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone-recon.txt
> {code}
> 4491: The following software may be included in this product: caniuse-lite. A 
> copy of the source code may be downloaded from 
> https://github.com/ben-eb/caniuse-lite.git. This software contains the 
> following license and notice below:
>1 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/LICENSE
> {code}
> 4491: The following software may be included in this product: caniuse-lite. A 
> copy of the source code may be downloaded from 
> https://github.com/ben-eb/caniuse-lite.git. This software contains the 
> following license and notice below:
>1 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/cjs-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/esm-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/ts-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/cjs-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/esm-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/ts-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/cjs-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/esm-use/package-lock.json
> {code}
> 2043: "caniuse-lite": "^1.0.30001208",
> 2140: "caniuse-lite": {
> 2142:   "resolved": 
> "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
>3 occurrence(s)
> {code}
> - 
> ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encodin

[jira] [Updated] (HDDS-11382) Remove the use of caniuse-lite

2024-08-28 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated HDDS-11382:
--
Description: 
After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite

- 
./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone-recon.txt
{code}
4491: The following software may be included in this product: caniuse-lite. A 
copy of the source code may be downloaded from 
https://github.com/ben-eb/caniuse-lite.git. This software contains the 
following license and notice below:
   1 occurrence(s)
{code}
- ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/LICENSE
{code}
4491: The following software may be included in this product: caniuse-lite. A 
copy of the source code may be downloaded from 
https://github.com/ben-eb/caniuse-lite.git. This software contains the 
following license and notice below:
   1 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/cjs-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/esm-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/ts-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/cjs-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/esm-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/ts-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/cjs-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/esm-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/ts-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- ./hadoop-ozone/recon/target/classes/webapps/recon/ozone-recon-web/LICENSE --
{code}
4491: The following software may be included in this product: caniuse-lite. A 
copy of the source code may be downloaded from 
https://github.com/ben-eb/caniuse-lite.git. This software contains the 
following license and notice below:
   1 occurrence(s)
{code}

Totally 30 occurrence(s) in 305545 file(s).


  was:
After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite

- 
./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone

[jira] [Commented] (HDDS-11368) Remove babel dependencies from Recon

2024-08-28 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877481#comment-17877481
 ] 

Tsz-wo Sze commented on HDDS-11368:
---

[~abhishek.pal], thanks a lot for fixing this!

After this, we still have caniuse-lite; filed HDDS-11382.

> Remove babel dependencies from Recon
> 
>
> Key: HDDS-11368
> URL: https://issues.apache.org/jira/browse/HDDS-11368
> Project: Apache Ozone
>  Issue Type: Task
>  Components: Ozone Recon
>Affects Versions: 1.4.0
>Reporter: Abhishek Pal
>Assignee: Abhishek Pal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.0
>
>
> *caniuse-lite* is currently being imported as a part of babel, which is 
> internally used by vitejs/plugin-react.
> Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be 
> used in our projects.
> This JIRA is to track the removal of the dependency from Recon



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Created] (HDDS-11382) Remove the use of caniuse-lite

2024-08-28 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created HDDS-11382:
-

 Summary: Remove the use of caniuse-lite
 Key: HDDS-11382
 URL: https://issues.apache.org/jira/browse/HDDS-11382
 Project: Apache Ozone
  Issue Type: Task
  Components: Ozone Recon
Reporter: Tsz-wo Sze


After HDDS-11368 is merged, we still have 30 occurrences of caniuse-lite

- 
./hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/licenses/LICENSE-ozone-recon.txt
{code}
4491: The following software may be included in this product: caniuse-lite. A 
copy of the source code may be downloaded from 
https://github.com/ben-eb/caniuse-lite.git. This software contains the 
following license and notice below:
   1 occurrence(s)
{code}
- ./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/LICENSE
{code}
4491: The following software may be included in this product: caniuse-lite. A 
copy of the source code may be downloaded from 
https://github.com/ben-eb/caniuse-lite.git. This software contains the 
following license and notice below:
   1 occurrence(s)
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/cjs-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/esm-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/@mswjs+interceptors@0.17.10/node_modules/web-encoding/test/ts-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/cjs-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/esm-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/node_modules/web-encoding/test/ts-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/cjs-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/esm-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- 
./hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/node_modules/.pnpm/web-encoding@1.1.5/node_modules/web-encoding/test/ts-use/package-lock.json
{code}
2043: "caniuse-lite": "^1.0.30001208",
2140: "caniuse-lite": {
2142:   "resolved": 
"https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001214.tgz";,
   3 occurrence(s)
{code}
- ./hadoop-ozone/recon/target/classes/webapps/recon/ozone-recon-web/LICENSE --
{code}
4491: The following software may be included in this product: caniuse-lite. A 
copy of the source code may be downloaded from 
https://github.com/ben-eb/caniuse-lite.git. This software contains the 
following license and notice below:
   1 occurrence(s)
{code}

Totally 30 occurrence(s) in 305545 file(s).




--
This messag

[jira] [Assigned] (HDDS-11375) DN Startup fails with Illegal configuration

2024-08-28 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned HDDS-11375:
-

Assignee: Wei-Chiu Chuang  (was: Tsz-wo Sze)

[~weichiu], thanks a lot for fixing this!

> DN Startup fails with Illegal configuration
> ---
>
> Key: HDDS-11375
> URL: https://issues.apache.org/jira/browse/HDDS-11375
> Project: Apache Ozone
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Pratyush Bhatt
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.0
>
>
> This is a problem if Ozone is upgraded to the latest (unreleased) Ratis code 
> base.  Ozone currently is using Ratis 3.1.0 which does not have this problem.
> All Ozone DN startup is failing with below error:
> {code:java}
> 2024-08-27 15:54:46,040 ERROR 
> [main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in 
> HddsDatanodeService.
> java.lang.RuntimeException: Can't start the HDDS datanode plugin
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>     at picocli.CommandLine.access$1500(CommandLine.java:148)
>     at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>     at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>     at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>     at picocli.CommandLine.execute(CommandLine.java:2174)
>     at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
>     at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159)
> Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal 
> configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m 
> (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432).
>     at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
>     at 
> org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196)
>     at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533)
>     at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209)
>     at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291)
>     ... 14 more
> Caused by: java.lang.IllegalArgumentException: Illegal configuration: 
> raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger 
> than raft.server.log.appender.buffer.byte-limit(= 33554432).
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184)
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152)
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57)
>     at 
> org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111)
>     at 
> org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133)
>     at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74)
>     at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212)
>     at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225)
>     at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212)
>     at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61)
>     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>     at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Nati

[jira] [Commented] (RATIS-2145) Follower hangs until the next trigger to take a snapshot

2024-08-27 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877227#comment-17877227
 ] 

Tsz-wo Sze commented on RATIS-2145:
---

Sure, will review this tomorrow.

>  Follower hangs until the next trigger to take a snapshot
> -
>
> Key: RATIS-2145
> URL: https://issues.apache.org/jira/browse/RATIS-2145
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Assignee: guangbao zhao
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We discovered a problem when writing tests with high concurrency. It often 
> happens that a follower is running well and then triggers takeSnalshot.
> The following is the relevant log.
> follower: （as the follower log says, between 2024/08/22 20:18:14,044 and 
> 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the 
> follower election was not triggered, indicating that the leader gave The 
> heartbeat sent by the follower is successful）
> {code:java}
> 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO 
> org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly 
> 22436696498 -> 22441096501
> 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment 
> /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615
> 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] 
> INFO org.apache.ratis.server.raftlog.RaftLog: 
> node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax 
> old=22432683959, new=22437078979, updated? true
> 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO 
> com.xxx.RaftJournalManager: Received install snapshot notification from 
> MetaStore leader: node3 with term index: (t:192, i:22441477801)
> 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO 
> com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from 
> Leader MetaStore node3. Checkpoint address: leader:8170
> 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, 
> i:22441477801)
> 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/22 20:21:57,067 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: Failed 
> appendEntries as snapshot (22441477801) installation is in progress
> 2024/08/22 20:21:57,068 [node1-server-thread55] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node3<-node1#19406445:FAIL-t192,INCONSISTENCY,nextIndex=22441098642,followerCommit=22441098595,matchIndex=-1{code}
> leader:
> {code:java}
> 2024/08/22 20:18:16,958 [timer5] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192, 
> i:22441098598)...(t:192, i:22441098622)
> 2024/08/22 20:18:16,964 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192, 
> i:22441098624)
> 2024/08/22 20:18:16,964 [timer6] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192, 
> i:22441098625)
> 2024/08/22 20:18:16,964 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192, 
> i:22441098623)
> 2024/08/22 20:18:16,965 [timer3] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192, 
> i:22441098627)
> 2024/08/22 20:18:16,965 [timer7] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=Appe

[jira] [Assigned] (HDDS-11375) DN Startup fails with Illegal configuration

2024-08-27 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned HDDS-11375:
-

   Assignee: Tsz-wo Sze
Description: 
This is a problem if Ozone is upgraded to the latest (unreleased) Ratis code 
base.  Ozone currently is using Ratis 3.1.0 which does not have this problem.

All Ozone DN startup is failing with below error:
{code:java}
2024-08-27 15:54:46,040 ERROR 
[main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in 
HddsDatanodeService.
java.lang.RuntimeException: Can't start the HDDS datanode plugin
    at 
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336)
    at 
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209)
    at 
org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177)
    at 
org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95)
    at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
    at picocli.CommandLine.access$1500(CommandLine.java:148)
    at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
    at 
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
    at picocli.CommandLine.execute(CommandLine.java:2174)
    at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
    at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
    at 
org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159)
Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal 
configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m 
(=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432).
    at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
    at 
org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196)
    at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533)
    at 
org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209)
    at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183)
    at 
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291)
    ... 14 more
Caused by: java.lang.IllegalArgumentException: Illegal configuration: 
raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger 
than raft.server.log.appender.buffer.byte-limit(= 33554432).
    at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184)
    at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152)
    at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57)
    at 
org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111)
    at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133)
    at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40)
    at 
org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212)
    at 
org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74)
    at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212)
    at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225)
    at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212)
    at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204)
    at 
org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73)
    at 
org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
    at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at 
org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:191)
    ... 20 more
2024-08-27 15:54:46,045 INFO 
[shutdown-hook-0]-org.apache.hadoop.ozone.HddsDatanodeService: SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down HddsDatanodeService at host/xx.xx.xx.xx{code}

  was:
All Ozone DN startup is failing with below error:
{code:java}
2024-08-27 15:54:46,040 ERROR 
[main]-org.apache.hadoop.oz

[jira] [Commented] (HDDS-11375) DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin"

2024-08-27 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877112#comment-17877112
 ] 

Tsz-wo Sze commented on HDDS-11375:
---

[~pratyush.bhatt], could provide the commit id of your code base?

> DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin"
> --
>
> Key: HDDS-11375
> URL: https://issues.apache.org/jira/browse/HDDS-11375
> Project: Apache Ozone
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Pratyush Bhatt
>Priority: Major
>
> All Ozone DN startup is failing with below error:
> {code:java}
> 2024-08-27 15:54:46,040 ERROR 
> [main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in 
> HddsDatanodeService.
> java.lang.RuntimeException: Can't start the HDDS datanode plugin
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>     at picocli.CommandLine.access$1500(CommandLine.java:148)
>     at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>     at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>     at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>     at picocli.CommandLine.execute(CommandLine.java:2174)
>     at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
>     at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159)
> Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal 
> configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m 
> (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432).
>     at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
>     at 
> org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196)
>     at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533)
>     at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209)
>     at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291)
>     ... 14 more
> Caused by: java.lang.IllegalArgumentException: Illegal configuration: 
> raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger 
> than raft.server.log.appender.buffer.byte-limit(= 33554432).
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184)
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152)
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57)
>     at 
> org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111)
>     at 
> org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133)
>     at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74)
>     at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212)
>     at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225)
>     at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212)
>     at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61)
>     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>     at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.base/java.lang.refle

[jira] [Comment Edited] (HDDS-11375) DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin"

2024-08-27 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877112#comment-17877112
 ] 

Tsz-wo Sze edited comment on HDDS-11375 at 8/27/24 5:27 PM:


[~pratyush.bhatt], could you provide the commit id of your code base?  I will 
take a look.


was (Author: szetszwo):
[~pratyush.bhatt], could provide the commit id of your code base?

> DN Startup fails with "RuntimeException: Can't start the HDDS datanode plugin"
> --
>
> Key: HDDS-11375
> URL: https://issues.apache.org/jira/browse/HDDS-11375
> Project: Apache Ozone
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Pratyush Bhatt
>Priority: Major
>
> All Ozone DN startup is failing with below error:
> {code:java}
> 2024-08-27 15:54:46,040 ERROR 
> [main]-org.apache.hadoop.ozone.HddsDatanodeService: Exception in 
> HddsDatanodeService.
> java.lang.RuntimeException: Can't start the HDDS datanode plugin
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:336)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:209)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:177)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:95)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>     at picocli.CommandLine.access$1500(CommandLine.java:148)
>     at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>     at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>     at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>     at picocli.CommandLine.execute(CommandLine.java:2174)
>     at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
>     at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:159)
> Caused by: java.io.IOException: java.lang.IllegalArgumentException: Illegal 
> configuration: raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m 
> (=1048576) larger than raft.server.log.appender.buffer.byte-limit(= 33554432).
>     at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
>     at 
> org.apache.ratis.server.RaftServer$Builder.newRaftServer(RaftServer.java:196)
>     at org.apache.ratis.server.RaftServer$Builder.build(RaftServer.java:210)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.(XceiverServerRatis.java:214)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.newXceiverServerRatis(XceiverServerRatis.java:533)
>     at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.(OzoneContainer.java:209)
>     at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.(DatanodeStateMachine.java:183)
>     at 
> org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:291)
>     ... 14 more
> Caused by: java.lang.IllegalArgumentException: Illegal configuration: 
> raft.grpc.message.size.max(= 32MB (=33554432)) must be 1m (=1048576) larger 
> than raft.server.log.appender.buffer.byte-limit(= 33554432).
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:184)
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:152)
>     at org.apache.ratis.grpc.server.GrpcService.(GrpcService.java:57)
>     at 
> org.apache.ratis.grpc.server.GrpcService$Builder.build(GrpcService.java:111)
>     at 
> org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:133)
>     at org.apache.ratis.grpc.GrpcFactory.newRaftServerRpc(GrpcFactory.java:40)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy.(RaftServerProxy.java:212)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.lambda$newRaftServer$0(ServerImplUtils.java:74)
>     at org.apache.ratis.util.JavaUtils.lambda$attempt$7(JavaUtils.java:212)
>     at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:225)
>     at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:212)
>     at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:204)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:73)
>     at 
> org.apache.ratis.server.impl.ServerImplUtils.newRaftServer(ServerImplUtils.java:61)
>     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>     at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccess

[jira] [Updated] (HDDS-11368) Remove babel dependencies from Recon

2024-08-26 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated HDDS-11368:
--
   Fix Version/s: (was: 1.5.0)
  (was: 1.4.1)
  (was: 2.0.0)
Target Version/s: 1.5.0, 1.4.1, 2.0.0

> Remove babel dependencies from Recon
> 
>
> Key: HDDS-11368
> URL: https://issues.apache.org/jira/browse/HDDS-11368
> Project: Apache Ozone
>  Issue Type: Task
>  Components: Ozone Recon
>Affects Versions: 1.4.0
>Reporter: Abhishek Pal
>Assignee: Abhishek Pal
>Priority: Blocker
>  Labels: pull-request-available
>
> *caniuse-lite* is currently being imported as a part of babel, which is 
> internally used by vitejs/plugin-react.
> Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be 
> used in our projects.
> This JIRA is to track the removal of the dependency from Recon



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Resolved] (RATIS-2132) Revert RATIS-2099 due to its performance regression

2024-08-26 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2132.
---
Resolution: Done

Reverted RATIS-2099 and RATIS-2101.  Thanks, [~duongnguyen] for reporting this!
{code}
commit 520ecab157c9c6aff87ed5c5978c08f98cd4ec6c (origin/master, origin/HEAD, 
master)
Author: Tsz-Wo Nicholas Sze 
Date:   Mon Aug 26 11:07:29 2024 -0700

Revert "RATIS-2099. Cache TermIndexImpl instead of using anonymous class 
(#1100)"

This reverts commit 428ce4ae3d5a0349f3425cb85ef1a3d38dea24b1.
{code}
{code}
commit da5d508caffc4ca90b0ab962b5105785b9774daa
Author: Tsz-Wo Nicholas Sze 
Date:   Mon Aug 26 11:07:17 2024 -0700

Revert "RATIS-2101. Move TermIndex.PRIVATE_CACHE to Util.CACHE (#1103)"

This reverts commit 93eb32a8620fdd4e5119592ef32bc50590810c7b.
{code}


> Revert RATIS-2099 due to its performance regression
> ---
>
> Key: RATIS-2132
> URL: https://issues.apache.org/jira/browse/RATIS-2132
> Project: Ratis
>  Issue Type: Sub-task
>Reporter: Duong
>Assignee: Duong
>Priority: Major
> Attachments: Screenshot 2024-07-29 at 5.07.32 PM.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This commit creates a significant extra cost in the critical path (which is 
> run sequentially) of Ratis appendTransaction. 
> !Screenshot 2024-07-29 at 5.07.32 PM.png|width=981,height=479!
> This seems to be a premature optimization. One or two instances of TermIndex 
> per request are basically nothing (unless we create hundreds/thousands of 
> them per request).   Short-lived POJO like this are the best to be dealt with 
> by java GC/heap.
> More details are the parent Jira RATIS-2129.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HDDS-11368) Remove babel dependencies from Recon

2024-08-26 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated HDDS-11368:
--
Fix Version/s: 1.5.0
   1.4.1
   2.0.0
Affects Version/s: (was: 1.5.0)
 Priority: Blocker  (was: Critical)

Setting this as blocker of 1.4.1 and 2.0.0

> Remove babel dependencies from Recon
> 
>
> Key: HDDS-11368
> URL: https://issues.apache.org/jira/browse/HDDS-11368
> Project: Apache Ozone
>  Issue Type: Task
>  Components: Ozone Recon
>Affects Versions: 1.4.0
>Reporter: Abhishek Pal
>Assignee: Abhishek Pal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.0, 1.4.1, 2.0.0
>
>
> *caniuse-lite* is currently being imported as a part of babel, which is 
> internally used by vitejs/plugin-react.
> Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be 
> used in our projects.
> This JIRA is to track the removal of the dependency from Recon



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Comment Edited] (HDDS-11368) Remove babel dependencies from Recon

2024-08-26 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876794#comment-17876794
 ] 

Tsz-wo Sze edited comment on HDDS-11368 at 8/26/24 5:16 PM:


Setting this as blocker of 1.4.1, 1.5.0 and 2.0.0


was (Author: szetszwo):
Setting this as blocker of 1.4.1 and 2.0.0

> Remove babel dependencies from Recon
> 
>
> Key: HDDS-11368
> URL: https://issues.apache.org/jira/browse/HDDS-11368
> Project: Apache Ozone
>  Issue Type: Task
>  Components: Ozone Recon
>Affects Versions: 1.4.0
>Reporter: Abhishek Pal
>Assignee: Abhishek Pal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.0, 1.4.1, 2.0.0
>
>
> *caniuse-lite* is currently being imported as a part of babel, which is 
> internally used by vitejs/plugin-react.
> Since the library (caniuse-lite) is licensed under *CC-by-4.0* it cannot be 
> used in our projects.
> This JIRA is to track the removal of the dependency from Recon



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Resolved] (RATIS-2143) Off-heap memory oom issue in SegmentedRaftLogReader

2024-08-23 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2143.
---
Resolution: Invalid

[~weiming], thanks for checking it!

Resolving ...

> Off-heap memory oom issue in SegmentedRaftLogReader
> ---
>
> Key: RATIS-2143
> URL: https://issues.apache.org/jira/browse/RATIS-2143
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 3.0.1
>Reporter: weiming
>Priority: Major
> Attachments: image-2024-08-21-15-17-45-705.png, 
> image-2024-08-21-15-41-00-261.png, image-2024-08-22-11-26-01-822.png
>
>
> In our ozone cluster, a DN was found in the SCM page to be in the DEAD state. 
> When restarting, the DN could not start normally, and an off-heap memory OOM 
> was found in the log.
>  
> ENV：
> ratis version release-3.0.1
>  
> JDK:
> openjdk 17.0.2 2022-01-18
> OpenJDK Runtime Environment (build 17.0.2+8-86)
> OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
>  
> Ozone DN JVM param：
> {code:java}
> //代码占位符
> export OZONE_DATANODE_OPTS="-Xms24g -Xmx48g -Xmn16g -XX:MetaspaceSize=512m 
> -XX:MaxDirectMemorySize=48g -XX:+UseG1GC -XX:MaxGCPauseMillis=60 
> -XX:ParallelGCThreads=32 -XX:ConcGCThreads=16 -XX:+AlwaysPreTouc
> h -XX:+TieredCompilation -XX:+UseStringDeduplication 
> -XX:+OptimizeStringConcat -XX:G1HeapRegionSize=32M 
> -XX:+ParallelRefProcEnabled -XX:ReservedCodeCacheSize=1024M 
> -XX:+UnlockExperimentalVMOptions -XX:G1M
> ixedGCLiveThresholdPercent=85 -XX:G1HeapWastePercent=10 
> -XX:InitiatingHeapOccupancyPercent=40 -XX:-G1UseAdaptiveIHOP -verbose:gc 
> -XX:+PrintGCDetails -XX:+PrintGC -XX:+ExitOnOutOfMemoryError -Dorg.apache.r
> atis.thirdparty.io.netty.tryReflectionSetAccessible=true 
> -Xlog:gc*=info:file=${OZONE_LOG_DIR}/dn_gc-%p.log:time,level,tags:filecount=50,filesize=100M
>  -XX:NativeMemoryTracking=detail " {code}
>  
> ERROR LOG：
>  
> java.lang.OutOfMemoryError: Cannot reserve 8192 bytes of direct buffer memory 
> (allocated: 51539599490, limit: 51539607552)
> at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:121)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:332)
> at java.base/sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:243)
> at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:293)
> at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:273)
> at java.base/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:232)
> at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
> at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:107)
> at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:101)
> at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
> at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
> at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:343)
> at java.base/java.io.FilterInputStream.read(FilterInputStream.java:132)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader$LimitedInputStream.read(SegmentedRaftLogReader.java:96)
> at java.base/java.io.DataInputStream.read(DataInputStream.java:151)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.verifyHeader(SegmentedRaftLogReader.java:172)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.init(SegmentedRaftLogInputStream.java:95)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:122)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:131)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:236)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:346)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:295)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:236)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:186)
> at java.base/java.lang.Thread.run(Thread.java:833)
> !image-2024-08-21-15-17-45-705.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HDDS-11360) NPE in OMRatisHelper

2024-08-23 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876326#comment-17876326
 ] 

Tsz-wo Sze commented on HDDS-11360:
---

[~jianghuazhu], we should find out why reply.getMessage() is null.

> NPE in OMRatisHelper
> 
>
> Key: HDDS-11360
> URL: https://issues.apache.org/jira/browse/HDDS-11360
> Project: Apache Ozone
>  Issue Type: Improvement
>  Components: OM
>Affects Versions: 1.4.0
>Reporter: JiangHua Zhu
>Priority: Major
>
> Found some NullPointerException in OMRatisHelper. Here are some cases.
> om ha switch:
> {code:java}
> 2024-08-23 13:41:35,785 [om22-server-thread545] INFO 
> org.apache.ratis.server.RaftServer$Division: om22@group-61C56C563FC9: receive 
> transferLeadership 
> TransferLeadershipRequest:client-CBC5546B4108->om22@group-61C56C563FC9, 
> cid=3, seq=null, RO, null
> 2024-08-23 13:41:35,786 [om22-server-thread545] INFO 
> org.apache.ratis.server.impl.TransferLeadership: om22@group-61C56C563FC9: 
> start transferring leadership to om21
> 2024-08-23 13:41:35,787 [om22-server-thread545] INFO 
> org.apache.ratis.server.impl.TransferLeadership: om22@group-61C56C563FC9: 
> sendStartLeaderElection to follower om21, lastEntry=(t:77, i:12154700362)
> 2024-08-23 13:41:35,787 [om22-server-thread545] INFO 
> org.apache.ratis.server.impl.TransferLeadership: om22@group-61C56C563FC9: 
> SUCCESS sent StartLeaderElection to transferee om21 immediately as it already 
> has up-to-date log
> {code}
> OMRatisHelper:
> {code:java}
> 2024-08-23 13:41:35,869 [IPC Server handler 113 on default port 9862] WARN 
> org.apache.hadoop.ipc.Server: IPC Server handler 113 on default port 9862, 
> call Call#8836 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> xxx.xxx.xxx.xxx:33796 / xx.xx.xx.xx:33796
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.ratis.protocol.Message.getContent()" because the return value of 
> "org.apache.ratis.protocol.RaftClientReply.getMessage()" is null
>   at 
> org.apache.hadoop.ozone.om.helpers.OMRatisHelper.getOMResponseFromRaftClientReply(OMRatisHelper.java:66)
>   at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:524)
>   at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$1(OzoneManagerRatisServer.java:279)
>   at 
> org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
>   at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:277)
>   at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:257)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:257)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:236)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:172)
>   at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:163)
>   at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1098)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1021)
>   at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)
> {code}
> s3gateway log:
> {code:java}
> 2024-08-23 13:41:35,801 [qtp1396431506-4981] INFO 
> org.apache.hadoop.io.retry.RetryInvocationHandler: 
> com.google.protobuf.ServiceException: 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): Cannot 
> invoke "org.apache.ratis.protocol.Message.getCon

[jira] [Resolved] (RATIS-2113) Use consistent method names and parameter types in RaftUtils

2024-08-22 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2113.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.

>  Use consistent method names and parameter types in RaftUtils
> -
>
> Key: RATIS-2113
> URL: https://issues.apache.org/jira/browse/RATIS-2113
> Project: Ratis
>  Issue Type: Improvement
>  Components: shell
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Fix For: 3.2.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Since RaftUtils is going to be a public API, we should: 
> - Use consistent method names: getPeerId vs buildRaftPeersFromStr.
> - Use consistent parameter types: PrintStream vs Consumer
> - Remove duplicated AbstractCommand.parseInetSocketAddress



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2140) Thread wait when installing snapshot

2024-08-22 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2140.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~z-bb]!

> Thread wait when installing snapshot
> 
>
> Key: RATIS-2140
> URL: https://issues.apache.org/jira/browse/RATIS-2140
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Assignee: guangbao zhao
>Priority: Major
> Fix For: 3.2.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to 
> be in the wait state due to timing issues, causing the installation of the 
> snapshot to fail, and triggering the follower to not receive the leader's 
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> node1:
> {code:java}
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: send 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: 
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, 
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
> reject PRE_VOTE from node2: this server is the leader and still has 
> leadership 
> ...{code}
> node2:
> {code:java}
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed 
> appendEntries as snapshot (17159569079) installation is in progress
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: receive installSnapshot: 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: reply installSnapshot: 
> node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, 
> i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.FollowerState: 
> node2@group-4F53D3317400-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: node2: shutdown 
> node2@group-4F53D3317400-FollowerState
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> changes role from  FOLLOWER to CANDIDATE at term 139 for chan

[jira] [Assigned] (RATIS-2140) Thread wait when installing snapshot

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-2140:
-

Assignee: guangbao zhao

> Thread wait when installing snapshot
> 
>
> Key: RATIS-2140
> URL: https://issues.apache.org/jira/browse/RATIS-2140
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Assignee: guangbao zhao
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to 
> be in the wait state due to timing issues, causing the installation of the 
> snapshot to fail, and triggering the follower to not receive the leader's 
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> node1:
> {code:java}
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: send 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: 
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, 
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
> reject PRE_VOTE from node2: this server is the leader and still has 
> leadership 
> ...{code}
> node2:
> {code:java}
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed 
> appendEntries as snapshot (17159569079) installation is in progress
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: receive installSnapshot: 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: reply installSnapshot: 
> node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, 
> i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.FollowerState: 
> node2@group-4F53D3317400-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: node2: shutdown 
> node2@group-4F53D3317400-FollowerState
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> changes role from  FOLLOWER to CANDIDATE at term 139 for changeToCandidate
> ...{code}
> node2 grpc thread stack：
> {code:java}
> jstack 118659 | grep -A 12 
>

[jira] [Resolved] (RATIS-2144) SegmentedRaftLogWorker should close the stream before releasing the buffer.

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2144.
---
Fix Version/s: 3.2.0
 Assignee: Xinyu Tan
   Resolution: Fixed

The pull request is now merged.  [~tanxinyu], thanks for the quick fix!

> SegmentedRaftLogWorker should close the stream before releasing the buffer.
> ---
>
> Key: RATIS-2144
> URL: https://issues.apache.org/jira/browse/RATIS-2144
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Tsz-wo Sze
>Assignee: Xinyu Tan
>Priority: Major
> Fix For: 3.2.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In the code below,  frees the buffer first and cleans up out. The buffer 
> content can be corrupted and then be flushed to out.
> {code}
> //SegmentedRaftLogWorker
>   void close() {
> ...
> PlatformDependent.freeDirectBuffer(writeBuffer);
> IOUtils.cleanup(LOG, out);
> LOG.info("{} close()", name);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2140) Thread wait when installing snapshot

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2140:
--
Component/s: gRPC

> Thread wait when installing snapshot
> 
>
> Key: RATIS-2140
> URL: https://issues.apache.org/jira/browse/RATIS-2140
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to 
> be in the wait state due to timing issues, causing the installation of the 
> snapshot to fail, and triggering the follower to not receive the leader's 
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> node1:
> {code:java}
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: send 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: 
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, 
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
> reject PRE_VOTE from node2: this server is the leader and still has 
> leadership 
> ...{code}
> node2:
> {code:java}
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed 
> appendEntries as snapshot (17159569079) installation is in progress
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: receive installSnapshot: 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: reply installSnapshot: 
> node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, 
> i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.FollowerState: 
> node2@group-4F53D3317400-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: node2: shutdown 
> node2@group-4F53D3317400-FollowerState
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> changes role from  FOLLOWER to CANDIDATE at term 139 for changeToCandidate
> ...{code}
> node2 grpc thread stack：
> {code:java}
> jstack 118659 | grep -A 12 
> node2-GrpcLogAppender-LogAppenderDaemon"node1@grou

[jira] [Commented] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875681#comment-17875681
 ] 

Tsz-wo Sze commented on HDDS-11352:
---

Filed RATIS-2144.

> Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
> --
>
> Key: HDDS-11352
> URL: https://issues.apache.org/jira/browse/HDDS-11352
> Project: Apache Ozone
>  Issue Type: Sub-task
>  Components: Ozone Manager
>Reporter: Ethan Rose
>Priority: Critical
> Attachments: it-om.zip
>
>
> Failure observed in [this 
> run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] 
> in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be 
> specific to that test in particular.
> {code}
> ---
> Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> ---
> Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s 
> <<< FAILURE! - in 
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown  Time 
> elapsed: 18.461 s  <<< ERROR!
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> omNode-1@group-523986131536: Failed to initRaftLog.
>   at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>   at 
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498)
>   at 
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>   at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206)
>   at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: java.lang.IllegalStateException: omNode-1@group-523986131536: 
> Failed to initRaftLog.
>   at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:171)
>   at 
> org.apache.ratis.server.impl.ServerState.lambda$new$6(ServerState.java:131)
>   at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:63)
>   at 
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:148)
>   at 
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:385)
>   at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203)
>   ... 4 more
> Caused by: org.apache.ratis.protocol.exceptions.ChecksumException: Log entry 
> corrupted: Calculated checksum is 3AB532B2 but read checksum is 31120F6C.
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:319)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:204)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:131)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:172)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:428)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:258)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:231)
>   at 
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:273)
>   at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:194)
>   at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:168)
>   ... 9 more
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.testListVolumes 
>  Time elapsed: 121.075 s  <<< ERROR!
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e

[jira] [Comment Edited] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678
 ] 

Tsz-wo Sze edited comment on HDDS-11352 at 8/21/24 11:00 PM:
-

{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] 
INFO  segmented.SegmentedRaftLogWorker 
(SegmentedRaftLogWorker.java:execute(637)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO  
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close() 
in two different threads about the same time.

Checked the code below, it frees the buffer first and cleans up out.  The 
buffer content can be corrupted and then be flushed to out.
{code}
//SegmentedRaftLogWorker
  void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
  }
{code}



was (Author: szetszwo):
{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] 
INFO  segmented.SegmentedRaftLogWorker 
(SegmentedRaftLogWorker.java:execute(637)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO  
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close() 
in two different threads about the same time.

Checked the code below, it frees the buffer first and cleans up out.  The 
buffer content can be corrupted and then be flushed to out.  It is recent 
change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
  void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
  }
{code}


> Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
> --
>
> Key: HDDS-11352
> URL: https://issues.apache.org/jira/browse/HDDS-11352
> Project: Apache Ozone
>  Issue Type: Sub-task
>  Components: Ozone Manager
>Reporter: Ethan Rose
>Priority: Critical
> Attachments: it-om.zip
>
>
> Failure observed in [this 
> run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] 
> in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be 
> specific to that test in particular.
> {code}
> ---
> Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> ---
> Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s 
> <<< FAILURE! - in 
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown  Time 
> elapsed: 18.461 s  <<< ERROR!
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> omNode-1@group-523986131536: Failed to initRaftLog.
>   at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>   at 
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498)
>   at 
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>   at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206)
>   at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Threa

[jira] [Commented] (RATIS-2065) Avoid the out-of-heap memory OOM phenomenon of frequent creation and deletion of Raft group scenarios

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875680#comment-17875680
 ] 

Tsz-wo Sze commented on RATIS-2065:
---

[~tanxinyu], it looks there is a bug; see RATIS-2144 . 

> Avoid the out-of-heap memory OOM phenomenon of frequent creation and deletion 
> of Raft group scenarios
> -
>
> Key: RATIS-2065
> URL: https://issues.apache.org/jira/browse/RATIS-2065
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Xinyu Tan
>Assignee: Xinyu Tan
>Priority: Major
> Fix For: 3.1.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current SegmentedRaftLogWorker will create one when it's created 
> [DirectBuffer|https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L209],
>  in the end will not take the initiative to release it. This requires the 
> corresponding in-heap memory to be released by the GC before the 
> corresponding off-heap memory can be freed.
> For frequent Raft Group creation and deletion scenarios, there may be plenty 
> of in-heap memory that will not trigger GC, but the out-of-heap memory will 
> be occupied by these deprecated DirectBuffers, and the out-of-heap memory OOM 
> phenomenon will eventually occur.
> In IoTDB, We will 
> [explicitly|https://github.com/apache/iotdb/blob/master/iotdb-core/datanode/src/main/java/org/apache/iotdb/db/utils/MmapUtil.java#L33]
>  release outside the heap memory, thus avoiding a similar situation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (RATIS-2144) SegmentedRaftLogWorker should close the stream before releasing the buffer.

2024-08-21 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created RATIS-2144:
-

 Summary: SegmentedRaftLogWorker should close the stream before 
releasing the buffer.
 Key: RATIS-2144
 URL: https://issues.apache.org/jira/browse/RATIS-2144
 Project: Ratis
  Issue Type: Bug
  Components: server
Reporter: Tsz-wo Sze


In the code below,  frees the buffer first and cleans up out. The buffer 
content can be corrupted and then be flushed to out.
{code}
//SegmentedRaftLogWorker
  void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
  }
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678
 ] 

Tsz-wo Sze edited comment on HDDS-11352 at 8/21/24 10:49 PM:
-

{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] 
INFO  segmented.SegmentedRaftLogWorker 
(SegmentedRaftLogWorker.java:execute(637)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO  
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close() 
in two different threads about the same time.

Checked the code below, it frees the buffer first and cleans up out.  The 
buffer content can be corrupted and then be flushed to out.  It is recent 
change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
  void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
  }
{code}



was (Author: szetszwo):
{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] 
INFO  segmented.SegmentedRaftLogWorker 
(SegmentedRaftLogWorker.java:execute(637)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO  
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close() 
in two different threads about the same time.

Checked the code below, it frees the buffer first and call close().  The buffer 
content can be corrupted.  It is recent change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
  void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
  }
{code}


> Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
> --
>
> Key: HDDS-11352
> URL: https://issues.apache.org/jira/browse/HDDS-11352
> Project: Apache Ozone
>  Issue Type: Sub-task
>  Components: Ozone Manager
>Reporter: Ethan Rose
>Priority: Critical
> Attachments: it-om.zip
>
>
> Failure observed in [this 
> run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] 
> in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be 
> specific to that test in particular.
> {code}
> ---
> Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> ---
> Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s 
> <<< FAILURE! - in 
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown  Time 
> elapsed: 18.461 s  <<< ERROR!
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> omNode-1@group-523986131536: Failed to initRaftLog.
>   at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>   at 
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498)
>   at 
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>   at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206)
>   at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.la

[jira] [Commented] (HDDS-11352) Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875678#comment-17875678
 ] 

Tsz-wo Sze commented on HDDS-11352:
---

{code}
2024-08-21 07:00:19,708 [omNode-1@group-523986131536-SegmentedRaftLogWorker] 
INFO  segmented.SegmentedRaftLogWorker 
(SegmentedRaftLogWorker.java:execute(637)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker: created new log segment 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-b7f92b3c-3189-4adb-a2d3-737d6c7b9dca/omNode-1/ratis/c9bc4cf4-3bc3-3c60-a66b-523986131536/current/log_inprogress_107
2024-08-21 07:00:19,709 [omNode-1-impl-thread1] INFO  
segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 
omNode-1@group-523986131536-SegmentedRaftLogWorker close()
{code}
In the log, SegmentedRaftLogWorker created new log segment and calling close() 
in two different threads about the same time.

Checked the code below, it frees the buffer first and call close().  The buffer 
content can be corrupted.  It is recent change by RATIS-2065.
{code}
//SegmentedRaftLogWorker
  void close() {
...
PlatformDependent.freeDirectBuffer(writeBuffer);
IOUtils.cleanup(LOG, out);
LOG.info("{} close()", name);
  }
{code}


> Intermittent Raft Log Corruption in TestOzoneManagerHAWithStoppedNodes
> --
>
> Key: HDDS-11352
> URL: https://issues.apache.org/jira/browse/HDDS-11352
> Project: Apache Ozone
>  Issue Type: Sub-task
>  Components: Ozone Manager
>Reporter: Ethan Rose
>Priority: Critical
> Attachments: it-om.zip
>
>
> Failure observed in [this 
> run|https://github.com/apache/ozone/actions/runs/10484629833/job/29039668567] 
> in {{TestOzoneManagerHAWithStoppedNodes#testListVolumes}}, but may not be 
> specific to that test in particular.
> {code}
> ---
> Test set: org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> ---
> Tests run: 12, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 621.712 s 
> <<< FAILURE! - in 
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes
> org.apache.hadoop.ozone.om.TestOzoneManagerHAWithStoppedNodes.twoOMDown  Time 
> elapsed: 18.461 s  <<< ERROR!
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> omNode-1@group-523986131536: Failed to initRaftLog.
>   at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>   at 
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1498)
>   at 
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)
>   at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>   at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>   at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:206)
>   at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: java.lang.IllegalStateException: omNode-1@group-523986131536: 
> Failed to initRaftLog.
>   at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:171)
>   at 
> org.apache.ratis.server.impl.ServerState.lambda$new$6(ServerState.java:131)
>   at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:63)
>   at 
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:148)
>   at 
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:385)
>   at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203)
>   ... 4 more
> Caused by: org.apache.ratis.protocol.exceptions.ChecksumException: Log entry 
> corrupted: Calculated checksum is 3AB532B2 but read checksum is 31120F6C.
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:319)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:204)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:131)
>   at 
> or

[jira] [Updated] (RATIS-2142) OOM for stateMachineCache use cases

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2142:
--
Summary: OOM for stateMachineCache use cases  (was: Memory leak for 
stateMachineCache use cases)

> OOM for stateMachineCache use cases
> ---
>
> Key: RATIS-2142
> URL: https://issues.apache.org/jira/browse/RATIS-2142
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.1.0
>Reporter: Duong
>Priority: Major
>
> In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a 
> reference to the original RaftClientRequest. This is not supposed to happen 
> as RaftLogCache entries should only refer to the LogEntries with data 
> truncated. 
> This problem impacts Apache Ozone. The reference form RaftLogCache entries 
> prevent the original RaftClientRequest (which contains a large data chunk) to 
> be GCed timely. The result is Ozone datanodes quickly run out of heap memory.
> This is not the case with the latest master branch, only with the 3.1.0 
> release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (RATIS-2142) OOM for stateMachineCache use cases

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2142.
---
Resolution: Duplicate

Resolving this as a duplicate of RATIS-2141.

> OOM for stateMachineCache use cases
> ---
>
> Key: RATIS-2142
> URL: https://issues.apache.org/jira/browse/RATIS-2142
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.1.0
>Reporter: Duong
>Priority: Major
>
> In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a 
> reference to the original RaftClientRequest. This is not supposed to happen 
> as RaftLogCache entries should only refer to the LogEntries with data 
> truncated. 
> This problem impacts Apache Ozone. The reference form RaftLogCache entries 
> prevent the original RaftClientRequest (which contains a large data chunk) to 
> be GCed timely. The result is Ozone datanodes quickly run out of heap memory.
> This is not the case with the latest master branch, only with the 3.1.0 
> release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2141) OOM for stateMachineCache use cases

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2141:
--
Summary: OOM for stateMachineCache use cases  (was: Memory leak for 
stateMachineCache use cases)

[~duongnguyen], thanks for finding out this problem!

"Memory leak" usually means that memory was allocated but not released; see
- https://en.wikipedia.org/wiki/Memory_leak

In this case, we are not having such problem.  Our problem is unnecessarily 
using too much memory.

Updating the Summary.

> OOM for stateMachineCache use cases
> ---
>
> Key: RATIS-2141
> URL: https://issues.apache.org/jira/browse/RATIS-2141
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.1.0
>Reporter: Duong
>Priority: Major
> Attachments: RaftLogCache_entry.png, heap-dump.png
>
>
> In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a 
> reference to the original RaftClientRequest. This is not supposed to happen 
> as RaftLogCache entries should only refer to the LogEntries with data 
> truncated, and RaftLogCache retention policy only counts the size of the 
> entries without data.
> This problem impacts Apache Ozone. The reference form RaftLogCache entries 
> prevent the original RaftClientRequest (which contains a large data chunk) to 
> be GCed. The result is Ozone datanodes quickly run out of heap memory.
> !heap-dump.png|width=1286,height=141!
> !RaftLogCache_entry.png|width=730,height=272!
> This is not the case with the latest master branch, only with the 3.1.0 
> release.
> The fix for this issue in 3.1.0 is as simple as 
> [6a141544c567a6325b05e2972cd426cdc14060cb|https://github.com/duongkame/ratis/commit/bcff74af0a5fa4b68af2267ce8dfa01f65a5445c].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (RATIS-2141) Memory leak for stateMachineCache use cases

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875564#comment-17875564
 ] 

Tsz-wo Sze edited comment on RATIS-2141 at 8/21/24 4:25 PM:


Let's revert  RATIS-1983 from 3.1.0.  I just have tried the revert.  It only 
has some minor conflicts.


was (Author: szetszwo):
Let's revert  RATIS-1983.  I just have tried the revert.  It only has some 
minor conflicts.

> Memory leak for stateMachineCache use cases
> ---
>
> Key: RATIS-2141
> URL: https://issues.apache.org/jira/browse/RATIS-2141
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.1.0
>Reporter: Duong
>Priority: Major
> Attachments: RaftLogCache_entry.png, heap-dump.png
>
>
> In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a 
> reference to the original RaftClientRequest. This is not supposed to happen 
> as RaftLogCache entries should only refer to the LogEntries with data 
> truncated, and RaftLogCache retention policy only counts the size of the 
> entries without data.
> This problem impacts Apache Ozone. The reference form RaftLogCache entries 
> prevent the original RaftClientRequest (which contains a large data chunk) to 
> be GCed. The result is Ozone datanodes quickly run out of heap memory.
> !heap-dump.png|width=1286,height=141!
> !RaftLogCache_entry.png|width=730,height=272!
> This is not the case with the latest master branch, only with the 3.1.0 
> release.
> The fix for this issue in 3.1.0 is as simple as 
> [6a141544c567a6325b05e2972cd426cdc14060cb|https://github.com/duongkame/ratis/commit/bcff74af0a5fa4b68af2267ce8dfa01f65a5445c].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2141) Memory leak for stateMachineCache use cases

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875564#comment-17875564
 ] 

Tsz-wo Sze commented on RATIS-2141:
---

Let's revert  RATIS-1983.  I just have tried the revert.  It only has some 
minor conflicts.

> Memory leak for stateMachineCache use cases
> ---
>
> Key: RATIS-2141
> URL: https://issues.apache.org/jira/browse/RATIS-2141
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.1.0
>Reporter: Duong
>Priority: Major
> Attachments: RaftLogCache_entry.png, heap-dump.png
>
>
> In 3.1.0, with stateMachineCache enabled, the RaftLogCache entries contain a 
> reference to the original RaftClientRequest. This is not supposed to happen 
> as RaftLogCache entries should only refer to the LogEntries with data 
> truncated, and RaftLogCache retention policy only counts the size of the 
> entries without data.
> This problem impacts Apache Ozone. The reference form RaftLogCache entries 
> prevent the original RaftClientRequest (which contains a large data chunk) to 
> be GCed. The result is Ozone datanodes quickly run out of heap memory.
> !heap-dump.png|width=1286,height=141!
> !RaftLogCache_entry.png|width=730,height=272!
> This is not the case with the latest master branch, only with the 3.1.0 
> release.
> The fix for this issue in 3.1.0 is as simple as 
> [6a141544c567a6325b05e2972cd426cdc14060cb|https://github.com/duongkame/ratis/commit/bcff74af0a5fa4b68af2267ce8dfa01f65a5445c].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-2143) Off-heap memory oom issue in SegmentedRaftLogReader

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875562#comment-17875562
 ] 

Tsz-wo Sze commented on RATIS-2143:
---

How about the number of pipelines?  We saw some cases that there were hundreds 
of uncleaned up pipelines in a datanode, which caused an OOM.  See Duong's 
reply in https://lists.apache.org/thread/dpo6tjmxy1n9gmc67jjjm7pon8txfyjb

> Off-heap memory oom issue in SegmentedRaftLogReader
> ---
>
> Key: RATIS-2143
> URL: https://issues.apache.org/jira/browse/RATIS-2143
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 3.0.1
>Reporter: weiming
>Priority: Major
> Attachments: image-2024-08-21-15-17-45-705.png, 
> image-2024-08-21-15-41-00-261.png, image-2024-08-21-15-43-24-729.png
>
>
> In our ozone cluster, a DN was found in the SCM page to be in the DEAD state. 
> When restarting, the DN could not start normally, and an off-heap memory OOM 
> was found in the log.
>  
> ENV：
> ratis version release-3.0.1
>  
> JDK:
> openjdk 17.0.2 2022-01-18
> OpenJDK Runtime Environment (build 17.0.2+8-86)
> OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
>  
> Ozone DN JVM param：
> {code:java}
> //代码占位符
> export OZONE_DATANODE_OPTS="-Xms24g -Xmx48g -Xmn16g -XX:MetaspaceSize=512m 
> -XX:MaxDirectMemorySize=48g -XX:+UseG1GC -XX:MaxGCPauseMillis=60 
> -XX:ParallelGCThreads=32 -XX:ConcGCThreads=16 -XX:+AlwaysPreTouc
> h -XX:+TieredCompilation -XX:+UseStringDeduplication 
> -XX:+OptimizeStringConcat -XX:G1HeapRegionSize=32M 
> -XX:+ParallelRefProcEnabled -XX:ReservedCodeCacheSize=1024M 
> -XX:+UnlockExperimentalVMOptions -XX:G1M
> ixedGCLiveThresholdPercent=85 -XX:G1HeapWastePercent=10 
> -XX:InitiatingHeapOccupancyPercent=40 -XX:-G1UseAdaptiveIHOP -verbose:gc 
> -XX:+PrintGCDetails -XX:+PrintGC -XX:+ExitOnOutOfMemoryError -Dorg.apache.r
> atis.thirdparty.io.netty.tryReflectionSetAccessible=true 
> -Xlog:gc*=info:file=${OZONE_LOG_DIR}/dn_gc-%p.log:time,level,tags:filecount=50,filesize=100M
>  -XX:NativeMemoryTracking=detail " {code}
>  
> ERROR LOG：
>  
> java.lang.OutOfMemoryError: Cannot reserve 8192 bytes of direct buffer memory 
> (allocated: 51539599490, limit: 51539607552)
> at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:121)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:332)
> at java.base/sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:243)
> at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:293)
> at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:273)
> at java.base/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:232)
> at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
> at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:107)
> at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:101)
> at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
> at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
> at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:343)
> at java.base/java.io.FilterInputStream.read(FilterInputStream.java:132)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader$LimitedInputStream.read(SegmentedRaftLogReader.java:96)
> at java.base/java.io.DataInputStream.read(DataInputStream.java:151)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.verifyHeader(SegmentedRaftLogReader.java:172)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.init(SegmentedRaftLogInputStream.java:95)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:122)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:131)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:236)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:346)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:295)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:236)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:186)
> at java.base/java.lang.Thread.run(Thread.java:833)
> !image-2024-08-21-15-17-45-705.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (RATIS-2116) Follower state synchronization is blocked

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875551#comment-17875551
 ] 

Tsz-wo Sze edited comment on RATIS-2116 at 8/21/24 4:16 PM:


bq. Curiosity about the Ratis community version policy, where can I find the 
currently supported feature versions? 

You may search for the "Fix Version/s" to find out the JIRAs or git to generate 
the diff between two releases.

bq. will such a bug fix be backported to lower versions?

If there is a need, we definitely can back port bug fixes.

If you are interested on a bug fix release for an older version, please feel 
free to let us know.


was (Author: szetszwo):
> Curiosity about the Ratis community version policy, where can I find the 
> currently supported feature versions? 

You may search for the "Fix Version/s" to find out the JIRAs or git to generate 
the diff between two releases.

> will such a bug fix be backported to lower versions?

If there is a need, we definitely can back port bug fixes.

> Follower state synchronization is blocked
> -
>
> Key: RATIS-2116
> URL: https://issues.apache.org/jira/browse/RATIS-2116
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.0.0, 2.5.1, 3.0.1
>Reporter: Haibo Sun
>Assignee: Haibo Sun
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: debug.log
>
>
> Using version 2.5.1, we have discovered that in some cases, the state 
> synchronization of the follower will be permanently blocked.
> Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern 
> (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of 
> RaftServerImpl.appendEntries does not immediately flush data and complete the 
> result future, because there is a pending PurgeLog task in the queue. It 
> enqueues the result future to be completed after the latter WriteLog flushes 
> data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and 
> will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This 
> leads to a deadlock and causes the state synchronization to stop.
> I confirmed this by adding debug logs, detailed information is attached 
> below. This issue can be easily reproduced by increasing the frequency of 
> TakeSnapshot and PurgeLog operations. In addition, after checking the code in 
> the master branch, this issue still exists.
>  
> *jstack:*
> {code:java}
> "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x7fc58400b800 
> nid=0x5493a waiting on condition [0x7fc5b4f28000] java.lang.Thread.State: 
> WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for 
> <0x7fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at 
> sun.misc.Unsafe.park(Unsafe.java:1025) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
>  at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
>  at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934) 
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379)
>  at 
> org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649)
>  at 
> org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231)
>  at 
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95)
>  at 
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  at 
> org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext

[jira] [Commented] (RATIS-2140) Thread wait when installing snapshot

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875559#comment-17875559
 ] 

Tsz-wo Sze commented on RATIS-2140:
---

We probably should always add a timeout when calling await().

> Thread wait when installing snapshot
> 
>
> Key: RATIS-2140
> URL: https://issues.apache.org/jira/browse/RATIS-2140
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 3.0.1
>Reporter: guangbao zhao
>Priority: Major
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to 
> be in the wait state due to timing issues, causing the installation of the 
> snapshot to fail, and triggering the follower to not receive the leader's 
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> node1:
> {code:java}
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: send 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: 
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, 
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
> reject PRE_VOTE from node2: this server is the leader and still has 
> leadership 
> ...{code}
> node2:
> {code:java}
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed 
> appendEntries as snapshot (17159569079) installation is in progress
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: receive installSnapshot: 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: reply installSnapshot: 
> node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, 
> i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.FollowerState: 
> node2@group-4F53D3317400-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: node2: shutdown 
> node2@group-4F53D3317400-FollowerState
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> changes role from  FOLLOWER to CANDIDATE at term 139 for changeToCandidate
> ...{code}
> node2 grpc thread stack：
> {code:java}
> jstack 118659 | grep -A 12 
> node2-GrpcLogAppender-LogAppender

[jira] [Resolved] (RATIS-2137) Leader fails to send correct index to follower after timeout exception

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2137.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~lemony]!

> Leader fails to send correct index to follower after timeout exception
> --
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.5.1
>Reporter: Kevin Liu
>Assignee: Kevin Liu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-08-13-11-28-16-250.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=nul

[jira] [Resolved] (HDDS-11331) Fix Datanode unable to report for a long time

2024-08-21 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDDS-11331.
---
Fix Version/s: 1.5.0
   Resolution: Fixed

The pull request is now merged.  Thanks, [~jianghuazhu]!

> Fix Datanode unable to report for a long time
> -
>
> Key: HDDS-11331
> URL: https://issues.apache.org/jira/browse/HDDS-11331
> Project: Apache Ozone
>  Issue Type: Improvement
>  Components: DN
>Affects Versions: 1.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.0
>
> Attachments: 1505js.1, 7090_review.patch, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png
>
>
> SCM shows that some Datanodes cannot report for a long time, and their status 
> is DEAD or STALE.
> I printed jstack information, which shows that StateContext#pipelineActions 
> is stuck and cannot report to SCM/Recon.
> The jstack information has been uploaded as an attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Commented] (RATIS-2116) Follower state synchronization is blocked

2024-08-21 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875551#comment-17875551
 ] 

Tsz-wo Sze commented on RATIS-2116:
---

> Curiosity about the Ratis community version policy, where can I find the 
> currently supported feature versions? 

You may search for the "Fix Version/s" to find out the JIRAs or git to generate 
the diff between two releases.

> will such a bug fix be backported to lower versions?

If there is a need, we definitely can back port bug fixes.

> Follower state synchronization is blocked
> -
>
> Key: RATIS-2116
> URL: https://issues.apache.org/jira/browse/RATIS-2116
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.0.0, 2.5.1, 3.0.1
>Reporter: Haibo Sun
>Assignee: Haibo Sun
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: debug.log
>
>
> Using version 2.5.1, we have discovered that in some cases, the state 
> synchronization of the follower will be permanently blocked.
> Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern 
> (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of 
> RaftServerImpl.appendEntries does not immediately flush data and complete the 
> result future, because there is a pending PurgeLog task in the queue. It 
> enqueues the result future to be completed after the latter WriteLog flushes 
> data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and 
> will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This 
> leads to a deadlock and causes the state synchronization to stop.
> I confirmed this by adding debug logs, detailed information is attached 
> below. This issue can be easily reproduced by increasing the frequency of 
> TakeSnapshot and PurgeLog operations. In addition, after checking the code in 
> the master branch, this issue still exists.
>  
> *jstack:*
> {code:java}
> "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x7fc58400b800 
> nid=0x5493a waiting on condition [0x7fc5b4f28000] java.lang.Thread.State: 
> WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for 
> <0x7fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at 
> sun.misc.Unsafe.park(Unsafe.java:1025) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
>  at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
>  at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934) 
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379)
>  at 
> org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649)
>  at 
> org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231)
>  at 
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95)
>  at 
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  at 
> org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  at 
> org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
>  at 
> org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelH

[jira] [Resolved] (HDFS-17606) Do not require implementing CustomizedCallbackHandler

2024-08-20 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-17606.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

The pull request is now merged.

> Do not require implementing CustomizedCallbackHandler
> -
>
> Key: HDFS-17606
> URL: https://issues.apache.org/jira/browse/HDFS-17606
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> HDFS-17576 added a CustomizedCallbackHandler interface which declares the 
> following method:
> {code}
>   void handleCallback(List callbacks, String name, char[] password)
>   throws UnsupportedCallbackException, IOException;
> {code}
> This Jira is to allow an implementation to define the handleCallback method 
> without implementing the CustomizedCallbackHandler interface.  It is to avoid 
> a security provider  depending on the HDFS project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-17606) Do not require implementing CustomizedCallbackHandler

2024-08-20 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-17606.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

The pull request is now merged.

> Do not require implementing CustomizedCallbackHandler
> -
>
> Key: HDFS-17606
> URL: https://issues.apache.org/jira/browse/HDFS-17606
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> HDFS-17576 added a CustomizedCallbackHandler interface which declares the 
> following method:
> {code}
>   void handleCallback(List callbacks, String name, char[] password)
>   throws UnsupportedCallbackException, IOException;
> {code}
> This Jira is to allow an implementation to define the handleCallback method 
> without implementing the CustomizedCallbackHandler interface.  It is to avoid 
> a security provider  depending on the HDFS project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (RATIS-2137) Leader fails to send correct index to follower after timeout exception

2024-08-20 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-2137:
-

Assignee: Kevin Liu

> Leader fails to send correct index to follower after timeout exception
> --
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 2.5.1
>Reporter: Kevin Liu
>Assignee: Kevin Liu
>Priority: Major
> Attachments: image-2024-08-13-11-28-16-250.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE

[jira] [Updated] (RATIS-2137) Leader fails to send correct index to follower after timeout exception

2024-08-20 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-2137:
--
Component/s: server

> Leader fails to send correct index to follower after timeout exception
> --
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.5.1
>Reporter: Kevin Liu
>Assignee: Kevin Liu
>Priority: Major
> Attachments: image-2024-08-13-11-28-16-250.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-Lea

[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception

2024-08-19 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874915#comment-17874915
 ] 

Tsz-wo Sze commented on RATIS-2137:
---

[~lemony], it would be great if you could submit a pull request.  Thank you in 
advance!

> Leader fails to send correct index to follower after timeout exception
> --
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 2.5.1
>Reporter: Kevin Liu
>Priority: Major
> Attachments: image-2024-08-13-11-28-16-250.png
>
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733

[jira] [Updated] (HDDS-11331) Fix Datanode unable to report for a long time

2024-08-18 Thread Tsz-wo Sze (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated HDDS-11331:
--
Attachment: 7090_review.patch

> Fix Datanode unable to report for a long time
> -
>
> Key: HDDS-11331
> URL: https://issues.apache.org/jira/browse/HDDS-11331
> Project: Apache Ozone
>  Issue Type: Improvement
>  Components: DN
>Affects Versions: 1.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1505js.1, 7090_review.patch, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png
>
>
> SCM shows that some Datanodes cannot report for a long time, and their status 
> is DEAD or STALE.
> I printed jstack information, which shows that StateContext#pipelineActions 
> is stuck and cannot report to SCM/Recon.
> The jstack information has been uploaded as an attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Created] (HDFS-17606) Do not require implementing CustomizedCallbackHandler

2024-08-17 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created HDFS-17606:
-

 Summary: Do not require implementing CustomizedCallbackHandler
 Key: HDFS-17606
 URL: https://issues.apache.org/jira/browse/HDFS-17606
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: security
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


HDFS-17576 added a CustomizedCallbackHandler interface which declares the 
following method:
{code}
  void handleCallback(List callbacks, String name, char[] password)
  throws UnsupportedCallbackException, IOException;
{code}

This Jira is to allow an implementation to define the handleCallback method 
without implementing the CustomizedCallbackHandler interface.  It is to avoid a 
security provider  depending on the HDFS project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-17606) Do not require implementing CustomizedCallbackHandler

2024-08-17 Thread Tsz-wo Sze (Jira)

Tsz-wo Sze created HDFS-17606:
-

 Summary: Do not require implementing CustomizedCallbackHandler
 Key: HDFS-17606
 URL: https://issues.apache.org/jira/browse/HDFS-17606
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: security
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


HDFS-17576 added a CustomizedCallbackHandler interface which declares the 
following method:
{code}
  void handleCallback(List callbacks, String name, char[] password)
  throws UnsupportedCallbackException, IOException;
{code}

This Jira is to allow an implementation to define the handleCallback method 
without implementing the CustomizedCallbackHandler interface.  It is to avoid a 
security provider  depending on the HDFS project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (RATIS-2137) Leader fails to send correct index to follower after timeout exception

2024-08-17 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874531#comment-17874531
 ] 

Tsz-wo Sze commented on RATIS-2137:
---

[~lemony], you are right that LogAppenderDefault incorrectly handles the 
INCONSISTENCY case.  It should use the getNextIndexForInconsistency(..) method 
in LogAppenderBase and then setNextIndex(..).
{code}
diff --git 
a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java
 
b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java
index 432a4199..f75a80f8 100644
--- 
a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java
+++ 
b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java
@@ -23,6 +23,7 @@ import 
org.apache.ratis.proto.RaftProtos.InstallSnapshotReplyProto;
 import org.apache.ratis.proto.RaftProtos.InstallSnapshotRequestProto;
 import org.apache.ratis.rpc.CallId;
 import org.apache.ratis.server.RaftServer;
+import org.apache.ratis.server.raftlog.RaftLog;
 import org.apache.ratis.server.raftlog.RaftLogIOException;
 import org.apache.ratis.server.util.ServerStringUtils;
 import org.apache.ratis.statemachine.SnapshotInfo;
@@ -34,6 +35,7 @@ import java.io.InterruptedIOException;
 import java.util.Comparator;
 import java.util.UUID;
 import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicLong;
 
 /**
  * The default implementation of {@link LogAppender}
@@ -55,7 +57,7 @@ class LogAppenderDefault extends LogAppenderBase {
   }
 
   /** Send an appendEntries RPC; retry indefinitely. */
-  private AppendEntriesReplyProto sendAppendEntriesWithRetries()
+  private AppendEntriesReplyProto sendAppendEntriesWithRetries(AtomicLong 
requestFirstIndex)
   throws InterruptedException, InterruptedIOException, RaftLogIOException {
 int retry = 0;
 
@@ -78,9 +80,12 @@ class LogAppenderDefault extends LogAppenderBase {
   return null;
 }
 
-AppendEntriesReplyProto r = sendAppendEntries(request.get());
+final AppendEntriesRequestProto proto = request.get();
+final AppendEntriesReplyProto reply = sendAppendEntries(proto);
+final long first = proto.getEntriesCount() > 0 ? 
proto.getEntries(0).getIndex() : RaftLog.INVALID_LOG_INDEX;
+requestFirstIndex.set(first);
 request.release();
-return r;
+return reply;
   } catch (InterruptedIOException | RaftLogIOException e) {
 throw e;
   } catch (IOException ioe) {
@@ -164,9 +169,10 @@ class LogAppenderDefault extends LogAppenderBase {
   }
   // otherwise if r is null, retry the snapshot installation
 } else {
-  final AppendEntriesReplyProto r = sendAppendEntriesWithRetries();
+  final AtomicLong requestFirstIndex = new 
AtomicLong(RaftLog.INVALID_LOG_INDEX);
+  final AppendEntriesReplyProto r = 
sendAppendEntriesWithRetries(requestFirstIndex);
   if (r != null) {
-handleReply(r);
+handleReply(r, requestFirstIndex.get());
   }
 }
   }
@@ -177,7 +183,8 @@ class LogAppenderDefault extends LogAppenderBase {
 }
   }
 
-  private void handleReply(AppendEntriesReplyProto reply) throws 
IllegalArgumentException {
+  private void handleReply(AppendEntriesReplyProto reply, long 
requestFirstIndex)
+  throws IllegalArgumentException {
 if (reply != null) {
   switch (reply.getResult()) {
 case SUCCESS:
@@ -200,7 +207,7 @@ class LogAppenderDefault extends LogAppenderBase {
   onFollowerTerm(reply.getTerm());
   break;
 case INCONSISTENCY:
-  getFollower().decreaseNextIndex(reply.getNextIndex());
+  
getFollower().setNextIndex(getNextIndexForInconsistency(requestFirstIndex, 
reply.getNextIndex()));
   break;
 case UNRECOGNIZED:
   LOG.warn("{}: received {}", this, reply.getResult());
{code}

> Leader fails to send correct index to follower after timeout exception
> --
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 2.5.1
>Reporter: Kevin Liu
>Priority: Major
> Attachments: image-2024-08-13-11-28-16-250.png
>
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] Raf

[jira] [Commented] (HDDS-11331) Datanode cannot report for a long time

2024-08-17 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874526#comment-17874526
 ] 

Tsz-wo Sze commented on HDDS-11331:
---

Sure, it seems a good idea!

> Datanode cannot report for a long time
> --
>
> Key: HDDS-11331
> URL: https://issues.apache.org/jira/browse/HDDS-11331
> Project: Apache Ozone
>  Issue Type: Improvement
>  Components: DN
>Affects Versions: 1.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: 1505js.1, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png
>
>
> SCM shows that some Datanodes cannot report for a long time, and their status 
> is DEAD or STALE.
> I printed jstack information, which shows that StateContext#pipelineActions 
> is stuck and cannot report to SCM/Recon.
> The jstack information has been uploaded as an attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Commented] (HDDS-11331) Datanode cannot report for a long time

2024-08-16 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874353#comment-17874353
 ] 

Tsz-wo Sze commented on HDDS-11331:
---

It seem that this can be fixed by changing pipelineActions to ConcurrentHashMap 
and Collections.synchronizedMap(LinkedHashMap)
{code}
  private final Map> 
pipelineActions = new ConcurrentHashMap<>();
{code}

{code}
  static class Key {
private final HddsProtos.PipelineID pipelineID;
private final PipelineAction.Action action;

Key(HddsProtos.PipelineID pipelineID, PipelineAction.Action action) {
  this.pipelineID = pipelineID;
  this.action = action;
}

@Override
public int hashCode() {
  return Objects.hashCode(pipelineID);
}

@Override
public boolean equals(Object obj) {
  if (this == obj) {
return true;
  } else if (!(obj instanceof Key)) {
return false;
  }
  final Key that = (Key) obj;
  return Objects.equals(this.action, that.action)
  && Objects.equals(this.pipelineID, that.pipelineID);
}
  }
{code}

> Datanode cannot report for a long time
> --
>
> Key: HDDS-11331
> URL: https://issues.apache.org/jira/browse/HDDS-11331
> Project: Apache Ozone
>  Issue Type: Improvement
>  Components: DN
>Affects Versions: 1.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: 1505js.1, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png
>
>
> SCM shows that some Datanodes cannot report for a long time, and their status 
> is DEAD or STALE.
> I printed jstack information, which shows that StateContext#pipelineActions 
> is stuck and cannot report to SCM/Recon.
> The jstack information has been uploaded as an attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Commented] (HDDS-11291) Datanode Command Handler blocked by executing ratis requests

2024-08-16 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874345#comment-17874345
 ] 

Tsz-wo Sze commented on HDDS-11291:
---

This problem may be similar to HDDS-11331.

> Datanode Command Handler blocked by executing ratis requests
> 
>
> Key: HDDS-11291
> URL: https://issues.apache.org/jira/browse/HDDS-11291
> Project: Apache Ozone
>  Issue Type: Bug
>Reporter: Janus Chow
>Assignee: Janus Chow
>Priority: Major
>  Labels: pull-request-available
>
> We met the following issue: Datanode command handler executing close 
> container request, but the timeout logic is not correct, so it blocks all 
> requests from SCM.
> The jstack shows as follows:
> {code:java}
> "Command processor thread" #215 daemon prio=5 os_prio=0 
> tid=0x7fcef3262000 nid=0xa56 waiting on condition [0x7fcf63f9d000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x7fd4ab6dcd38> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>         at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>         at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>         at 
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.executeSubmitClientRequestAsync(RaftServerImpl.java:816)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$7(RaftServerProxy.java:436)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy$$Lambda$827/1961332062.apply(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
>         at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:436)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.submitRequest(XceiverServerRatis.java:611)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CloseContainerCommandHandler.handle(CloseContainerCommandHandler.java:105)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:103)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$3(DatanodeStateMachine.java:593)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine$$Lambda$270/1788388131.run(Unknown
>  Source)
>         at java.lang.Thread.run(Thread.java:748) {code}
> The direct reason is the timeout logic is not working, because in Ratis the 
> executeSubmitClientRequestAsync is a join() operation, and it will block the 
> timeout on the outer CompletableFuture.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Commented] (HDDS-11331) Datanode cannot report for a long time

2024-08-16 Thread Tsz-wo Sze (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-11331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874293#comment-17874293
 ] 

Tsz-wo Sze commented on HDDS-11331:
---

[~jianghuazhu], why the thread was stuck in calculatePipelineBytesWritten()?

> Datanode cannot report for a long time
> --
>
> Key: HDDS-11331
> URL: https://issues.apache.org/jira/browse/HDDS-11331
> Project: Apache Ozone
>  Issue Type: Improvement
>  Components: DN
>Affects Versions: 1.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: 1505js.1
>
>
> SCM shows that some Datanodes cannot report for a long time, and their status 
> is DEAD or STALE.
> I printed jstack information, which shows that StateContext#pipelineActions 
> is stuck and cannot report to SCM/Recon.
> The jstack information has been uploaded as an attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1925 matches

Mail list logo