[ 
https://issues.apache.org/jira/browse/RATIS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568098#comment-17568098
 ] 

Riguz Lee commented on RATIS-1625:
----------------------------------

I think the original design is good, if the setConf call could block until it 
succeed. However if we have set the retry strategy for raft client, then call 
setConf might fail before the new nodes started. Here's what I found:
{noformat}
org.apache.ratis.protocol.exceptions.RaftRetryFailureException: Failed 
SetConfigurationRequest:client-23453424E9DC->dw-dynamic-service-1@group-ABB3109A44C1,
 cid=1, seq=0, RW, null, 
peers:[dw-dynamic-service-3|rpc:dw-dynamic-service-3:6000|priority:0, 
dw-dynamic-service-2|rpc:dw-dynamic-service-2:6000|priority:0, 
dw-dynamic-service-1|rpc:dw-dynamic-service-1:6000|priority:0, 
dw-dynamic-service-0|rpc:dw-dynamic-service-0:6000|priority:0, 
dw-dynamic-service-4|rpc:dw-dynamic-service-4:6000|priority:0] for 10 attempts 
with RetryLimited(maxAttempts=10, sleepTime=1000ms){noformat}

> client.admin().setConfiguration fails due to ReconfigurationTimeoutException
> ----------------------------------------------------------------------------
>
>                 Key: RATIS-1625
>                 URL: https://issues.apache.org/jira/browse/RATIS-1625
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Riguz Lee
>            Priority: Major
>
> As has been discussed in 
> [https://lists.apache.org/thread/tt1j3jkogh71k2hvq5gtltwmphxfy736]
> , the problem is that:
>  * New nodes will be stopped by the leader because it's not in the old 
> configuration
>  * setConfiguration won't success because it cannot communicate to new nodes, 
> since they got shutdown.
> Steps to repdoduce:
>  * Start a cluster with 3x nodes
>  * Start 2 new nodes with 5x configuration
>  * Call api to change the configuration in old nodes
> Logs when calling admin api:
> {noformat}
> org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException: 
> 10.19.26.23-6002@group-0242AC120002-CotionStagingState: Fail to set 
> configuration 
> [10.19.26.23-6004|rpc:10.19.26.23:6004|admin:|client:|dataStreamity:0, 
> 10.19.26.23-6003|rpc:10.19.26.23:6003|admin:|client:|dataStream:|priority:0, 
> 10.19.26.23-6002|rpc:10.3:6002|admin:|client:|dataStream:|priority:0, 
> 10.19.26.23-6001|rpc:10.19.26.23:6001|admin:|client:|dataStrearity:0, 
> 10.19.26.23-6005|rpc:10.19.26.23:6005|admin:|client:|dataStream:|priority:0] 
> due to NOPROGRESS
>     at 
> org.apache.ratis.server.impl.LeaderStateImpl$ConfigurationStagingState.fail(LeaderStateImpl.java:[ratis-server-2.3.0.jar!/:2.3.0]
>     at 
> org.apache.ratis.server.impl.LeaderStateImpl.checkStaging(LeaderStateImpl.java:704)
>  ~[ratis-serve.jar!/:2.3.0]
>     at 
> org.apache.ratis.server.impl.LeaderStateImpl.access$500(LeaderStateImpl.java:95)
>  ~[ratis-server-2r!/:2.3.0]
>     at 
> org.apache.ratis.server.impl.LeaderStateImpl$EventProcessor.run(LeaderStateImpl.java:636)
>  ~[ratis-2.3.0.jar!/:2.3.0]{noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to