[ 
https://issues.apache.org/jira/browse/HDDS-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057833#comment-17057833
 ] 

Shashikant Banerjee edited comment on HDDS-3086 at 3/12/20, 12:00 PM:
----------------------------------------------------------------------

2020-02-27 14:50:15,865 [Thread-1361] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162*, seq=0, Watch-ALL_COMMITTED(152), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

 2020-02-27 14:50:15,876 [Thread-1371] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151*, seq=0, Watch-ALL_COMMITTED(135), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

Looks like, on getting a NotLeaderException, the raft client is not retrying on 
a different server.  This is because handleIOException function is not 
synchronised and  can get called in different threads using the same raft 
client instance as the example quoted here and thereby changing the leaderID 
field in RaftClientImpl instance.

[~swagle], can you have a look at this?
cc ~[~msingh]



was (Author: shashikant):
2020-02-27 14:50:15,865 [Thread-1361] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162*, seq=0, Watch-ALL_COMMITTED(152), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

 2020-02-27 14:50:15,876 [Thread-1371] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151*, seq=0, Watch-ALL_COMMITTED(135), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

Looks like, on getting a NotLeaderException, the raft client is not retrying on 
a different server.  This is because handleIOException function is not 
synchronised and  can get called in different threads using the same raft 
client instance as the example quoted here and thereby changing the leaderID 
field in RaftClientImpl instance.

[~swagle], can you have a look at this?
cc ~[~msingh]


> Failure running integration test it-freon 
> ------------------------------------------
>
>                 Key: HDDS-3086
>                 URL: https://issues.apache.org/jira/browse/HDDS-3086
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: freon
>            Reporter: Supratim Deka
>            Assignee: Siddharth Wagle
>            Priority: Major
>         Attachments: debug_output.zip, 
> org.apache.hadoop.fs.ozone.contract.ITestOzoneContractDistCp-output.txt, 
> org.apache.hadoop.ozone.freon.TestDataValidateWithDummyContainers-output.txt, 
> org.apache.hadoop.ozone.freon.TestRandomKeyGenerator-output.txt, 
> org.apache.hadoop.ozone.freon.TestRandomKeyGenerator.txt
>
>
> Observed a time-out during pr-check/it-freon for HDDS-2940. Failure appears 
> unrelated to the changes in the patch. 
> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.193 
> s - in org.apache.hadoop.ozone.freon.TestDataValidateWithUnsafeByteOperations
> 2862
> [INFO] Running org.apache.hadoop.ozone.freon.TestFreonWithDatanodeRestart
> 2863
> [WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
> 30.559 s - in org.apache.hadoop.ozone.freon.TestFreonWithDatanodeRestart
> 2864
> [INFO] 
> 2865
> [INFO] Results:
> 2866
> [INFO] 
> 2867
> [WARNING] Tests run: 16, Failures: 0, Errors: 0, Skipped: 3
> 2868
> [INFO] 
> 2869
> [INFO] 
> ------------------------------------------------------------------------
> 2870
> [INFO] BUILD FAILURE
> 2871
> [INFO] 
> ------------------------------------------------------------------------
> 2872
> [INFO] Total time:  28:58 min
> 2873
> [INFO] Finished at: 2020-02-26T17:55:42Z
> 2874
> [INFO] 
> ------------------------------------------------------------------------
> 2875
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) 
> on project hadoop-ozone-integration-test: There was a timeout or other error 
> in the fork -> [Help 1]
> 2876
> [ERROR] 
> 2877
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> 2878
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> 2879
> [ERROR] 
> 2880
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> 2881
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to