slfan1989 commented on PR #1264:
URL: https://github.com/apache/ratis/pull/1264#issuecomment-2888980787

   @szetszwo @adoroszlai
   
   I’ve locally fixed the issues in the TestRaftAsyncWithNetty unit test. Below 
is an analysis of the problem and its root causes:
   
   First, the test failures are not related to our upgrade to JUnit5. The tests 
also fail under JUnit4.
   
   There are two key issues:
   
   - `NettyClientRpc` lacks proper exception handling: When an exception 
occurs, it is thrown directly, which prevents the retry mechanism from being 
triggered.
   
   - `NettyClientRpc` does not implement timeout handling: In timeout 
scenarios, there is no effective control or handling logic in place.
   
   The test results are as follows:
   
   <img width="1839" alt="image" 
src="https://github.com/user-attachments/assets/ff7f931d-1cbc-46b9-9d80-3168fe043322";
 />
   
   -Enhanced exception handling logic in `ratis-netty`
   
   After comparing the execution flow of `ratis-grpc`, I found that it handles 
exceptions properly, which allows the retry mechanism at the `ratis-rpc` layer 
to be triggered. However, `ratis-netty` lacks this corresponding logic.
   
   This leads to an issue: in the `ratis-netty` unit test 
(`TestRaftAsyncWithNetty#testStaleReadAsync`), the test fails immediately if 
`s0` is not the leader. In contrast, `ratis-grpc` handles the 
NotLeaderException properly, which triggers the retry mechanism, allowing the 
test to pass successfully.
   
   Exception handling logic in `ratis-grpc`
   
   
[ratis/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java](https://github.com/apache/ratis/blob/2eda35dd532f503b299db9e3263ecbeca52da023/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java#L307-L329)
   
   I implemented similar logic in the `NettyClientRpc.java` code.
   
   Added timeout handling logic in `ratis-netty`
   
   The unit `test(misc)` for 
[RATIS-2251](https://issues.apache.org/jira/browse/RATIS-2251) frequently times 
out. The original timeout was set to `60` minutes, and even after increasing it 
to `90` and `120` minutes, the test still timed out. This indicates that the 
issue likely lies within the test itself, rather than the configured timeout 
duration.
   
   The root cause is that ratis-netty lacks a timeout mechanism, which causes 
the unit test to hang for extended periods when exceptions occur or there is no 
response.
   
   ratis-grpc includes corresponding exception and timeout handling logic, as 
shown below:
   
   
[ratis/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java](https://github.com/apache/ratis/blob/2eda35dd532f503b299db9e3263ecbeca52da023/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java#L367-L371)
   
   I implemented similar logic in the `NettyClientRpc.java` code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to