Junhong Xu created HBASE-24548:
----------------------------------

             Summary: improvement for HBase SCP
                 Key: HBASE-24548
                 URL: https://issues.apache.org/jira/browse/HBASE-24548
             Project: HBase
          Issue Type: Improvement
            Reporter: Junhong Xu
            Assignee: Junhong Xu


In our internal hbase based on branch-2.1 in community, we find after the 
regionserver is stopped about 30 s later, the master find it dead finally from 
its ephemeral node deleted in zk. During this time, the regions on this server 
is unavailable and no progress. The log is as follows:
{code:java}
[2020-06-12 15:51:41.888 
ActorThreadPool-consumer-processor-talos-set-alias-55-1 ERROR 
c.x.xmpush.hbase.utils.HBaseHelper] [get data hbase failed, tableName = 
mipush:app_alias_new]
com.xiaomi.infra.hbase.client.HException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=10, exceptions:
Fri Jun 12 15:50:44 CST 2020, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@2dc1865, 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
c3-hadoop-srv-st639.bj,13700,1591932264018 stopping
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1551)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2565)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:134)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)

Fri Jun 12 15:50:44 CST 2020, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@2dc1865, 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
c3-hadoop-srv-st639.bj,13700,1591932264018 stopping
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1551)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2565)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:134)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{code}
The logs in master:
{code:java}
2020-06-12,15:51:12,003 INFO [RegionServerTracker-0] 
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration [c3-hadoop-srv-st639.bj,13700,1591932264018]
2020-06-12,15:51:12,003 INFO [RegionServerTracker-0] 
org.apache.hadoop.hbase.master.ServerManager: Processing expiration of 
c3-hadoop-srv-st639.bj,13700,1591932264018 on 
c3-hadoop-miui-zk05.bj,13600,1591927126881
2020-06-12,15:51:12,109 INFO [RegionServerTracker-0] 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Added 
c3-hadoop-srv-st639.bj,13700,1591932264018 to dead servers which 
carryingMeta=false, submitted ServerCrashProcedure pid=97428
2020-06-12,15:51:12,109 INFO 
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$ServerEventsListenerThread-c3-hadoop-miui-zk05.bj,13600,1591927126881]
 
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$ServerEventsListenerThread:
 Updating default servers.
2020-06-12,15:51:12,111 INFO [PEWorker-11] 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=97428, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=c3-hadoop-srv-st639.bj,13700,1591932264018, splitWal=true, meta=false
{code}
After discussion with [~zghao] offline, we could accelerate this process by 
sending the message to the master or deleting the ephemeral node itself before 
stop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to