slfan1989 commented on PR #1227:
URL: https://github.com/apache/ratis/pull/1227#issuecomment-2888970393

   @szetszwo @adoroszlai 
   
   I’ve locally fixed the issues in the `TestRaftAsyncWithNetty` unit test. 
Below is an analysis of the problem and its root causes:
   
   First, the test failures are not related to our upgrade to `JUnit5`. The 
tests also fail under `JUnit4`.
   
   There are two key issues:
   
   `NettyClientRpc` lacks proper exception handling: When an exception occurs, 
it is thrown directly, which prevents the retry mechanism from being triggered.
   
   `NettyClientRpc` does not implement timeout handling: In timeout scenarios, 
there is no effective control or handling logic in place. 
   
   The test results are as follows:
   
   <img width="1429" alt="image" 
src="https://github.com/user-attachments/assets/d90d9097-023e-4bd9-90b0-e12ed054975b";
 />
   
   - Enhanced exception handling logic in `ratis-netty`
   
   After comparing the execution flow of `ratis-grpc`, I found that it handles 
exceptions properly, which allows the retry mechanism at the `ratis-rpc` layer 
to be triggered. However, `ratis-netty` lacks this corresponding logic.
   
   This leads to an issue: in the `ratis-netty` unit test 
(`TestRaftAsyncWithNetty#testStaleReadAsync`), the test fails immediately if 
`s0` is not the leader. In contrast, `ratis-grpc` handles the 
`NotLeaderException` properly, which triggers the retry mechanism, allowing the 
test to pass successfully.
   
   Exception handling logic in `ratis-grpc`
   
   
https://github.com/apache/ratis/blob/2eda35dd532f503b299db9e3263ecbeca52da023/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java#L307-L329
   
   I implemented similar logic in the `NettyClientRpc.java` code.
   
   - Added timeout handling logic  in `ratis-netty`
   
   The unit test(`misc`) for RATIS-2251 frequently times out. The original 
timeout was set to 60 minutes, and even after increasing it to 90 and 120 
minutes, the test still timed out. This indicates that the issue likely lies 
within the test itself, rather than the configured timeout duration.
   
   The root cause is that `ratis-netty` lacks a timeout mechanism, which causes 
the unit test to hang for extended periods when exceptions occur or there is no 
response.
   
   `ratis-grpc` includes corresponding exception and timeout handling logic, as 
shown below:
   
   
https://github.com/apache/ratis/blob/2eda35dd532f503b299db9e3263ecbeca52da023/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientProtocolClient.java#L367-L371
   
   I implemented similar logic in the `NettyClientRpc.java` code.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to