slfan1989 commented on PR #1264: URL: https://github.com/apache/ratis/pull/1264#issuecomment-2888980787
@szetszwo @adoroszlai I’ve locally fixed the issues in the TestRaftAsyncWithNetty unit test. Below is an analysis of the problem and its root causes: First, the test failures are not related to our upgrade to JUnit5. The tests also fail under JUnit4. There are two key issues: - `NettyClientRpc` lacks proper exception handling: When an exception occurs, it is thrown directly, which prevents the retry mechanism from being triggered. - `NettyClientRpc` does not implement timeout handling: In timeout scenarios, there is no effective control or handling logic in place. The test results are as follows: <img width="1839" alt="image" src="https://github.com/user-attachments/assets/ff7f931d-1cbc-46b9-9d80-3168fe043322" /> -Enhanced exception handling logic in `ratis-netty` After comparing the execution flow of `ratis-grpc`, I found that it handles exceptions properly, which allows the retry mechanism at the `ratis-rpc` layer to be triggered. However, `ratis-netty` lacks this corresponding logic. This leads to an issue: in the `ratis-netty` unit test (`TestRaftAsyncWithNetty#testStaleReadAsync`), the test fails immediately if `s0` is not the leader. In contrast, `ratis-grpc` handles the NotLeaderException properly, which triggers the retry mechanism, allowing the test to pass successfully. Exception handling logic in `ratis-grpc` [ratis/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java](https://github.com/apache/ratis/blob/2eda35dd532f503b299db9e3263ecbeca52da023/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java#L307-L329) I implemented similar logic in the `NettyClientRpc.java` code. Added timeout handling logic in `ratis-netty` The unit `test(misc)` for [RATIS-2251](https://issues.apache.org/jira/browse/RATIS-2251) frequently times out. The original timeout was set to `60` minutes, and even after increasing it to `90` and `120` minutes, the test still timed out. This indicates that the issue likely lies within the test itself, rather than the configured timeout duration. The root cause is that ratis-netty lacks a timeout mechanism, which causes the unit test to hang for extended periods when exceptions occur or there is no response. ratis-grpc includes corresponding exception and timeout handling logic, as shown below: [ratis/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java](https://github.com/apache/ratis/blob/2eda35dd532f503b299db9e3263ecbeca52da023/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java#L367-L371) I implemented similar logic in the `NettyClientRpc.java` code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
