Umesh Kumar Kumawat created HBASE-29714:
-------------------------------------------

             Summary: DEFAULT_RS_REMOTE_PROC_RETRY_LIMIT is too low causing RS 
abortion in just 3 seconds
                 Key: HBASE-29714
                 URL: https://issues.apache.org/jira/browse/HBASE-29714
             Project: HBase
          Issue Type: Improvement
          Components: proc-v2
            Reporter: Umesh Kumar Kumawat
             Fix For: 3.0.0, 2.6.5


DEFAULT_RS_RPC_RETRY_INTERVAL is 100 second and 
DEFAULT_RS_REMOTE_PROC_RETRY_LIMIT is 5 so we are aborting the RS in just 3 
seconds
2025-11-07 11:48:36,298 WARN [RSProcedureDispatcher-pool-30146] 
procedure.RSProcedureDispatcher - request to rs1,xxx,1762426870165 failed due 
to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
address=rs1:xxx failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection 
closed, try=0, retrying...

 

2025-11-07 11:48:36,399 WARN [RSProcedureDispatcher-pool-30136] 
procedure.RSProcedureDispatcher - request to rs1,xxx,1762426870165 failed due 
to java.io.IOException: Call to address=rs1:xxx failed on local exception: 
java.io.IOException: Can not send request because relogin is in progress., 
try=1, retrying... ,

2025-11-07 11:48:36,799 WARN [RSProcedureDispatcher-pool-30116] 
procedure.RSProcedureDispatcher - request to rs1,xxx,1762426870165 failed due 
to java.io.IOException: Call to address=rs1:xxx failed on local exception: 
java.io.IOException: Can not send request because relogin is in progress., 
try=2, retrying...

2025-11-07 11:48:37,700 WARN [RSProcedureDispatcher-pool-30088] 
procedure.RSProcedureDispatcher - request to rs1,xxx,1762426870165 failed due 
to java.io.IOException: Call to address=rs1:xxx failed on local exception: 
java.io.IOException: Can not send request because relogin is in progress., 
try=3, retrying...

2025-11-07 11:48:39,301 WARN [RSProcedureDispatcher-pool-30124] 
procedure.RSProcedureDispatcher - Number of retries 5 exceeded limit 5 for the 
given error type. Scheduling server crash for rs1,xxx,1762426870165

 

as per the logs in just 100(1^2+2^2+3^2+4^2) seconds RS got aborted. Taking hte 
example of RITThreshold as 60 seconds, we should at least wait for 30 seconds. 
that will happen if we wait keep the DEFAULT_RS_REMOTE_PROC_RETRY_LIMIT 10 
(28.5 seconds).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to