Junhong Xu created HBASE-24548:
----------------------------------
Summary: improvement for HBase SCP
Key: HBASE-24548
URL: https://issues.apache.org/jira/browse/HBASE-24548
Project: HBase
Issue Type: Improvement
Reporter: Junhong Xu
Assignee: Junhong Xu
In our internal hbase based on branch-2.1 in community, we find after the
regionserver is stopped about 30 s later, the master find it dead finally from
its ephemeral node deleted in zk. During this time, the regions on this server
is unavailable and no progress. The log is as follows:
{code:java}
[2020-06-12 15:51:41.888
ActorThreadPool-consumer-processor-talos-set-alias-55-1 ERROR
c.x.xmpush.hbase.utils.HBaseHelper] [get data hbase failed, tableName =
mipush:app_alias_new]
com.xiaomi.infra.hbase.client.HException:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=10, exceptions:
Fri Jun 12 15:50:44 CST 2020,
org.apache.hadoop.hbase.client.RpcRetryingCaller@2dc1865,
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server
c3-hadoop-srv-st639.bj,13700,1591932264018 stopping
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1551)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2565)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:134)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
Fri Jun 12 15:50:44 CST 2020,
org.apache.hadoop.hbase.client.RpcRetryingCaller@2dc1865,
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server
c3-hadoop-srv-st639.bj,13700,1591932264018 stopping
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1551)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2565)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:134)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{code}
The logs in master:
{code:java}
2020-06-12,15:51:12,003 INFO [RegionServerTracker-0]
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node
deleted, processing expiration [c3-hadoop-srv-st639.bj,13700,1591932264018]
2020-06-12,15:51:12,003 INFO [RegionServerTracker-0]
org.apache.hadoop.hbase.master.ServerManager: Processing expiration of
c3-hadoop-srv-st639.bj,13700,1591932264018 on
c3-hadoop-miui-zk05.bj,13600,1591927126881
2020-06-12,15:51:12,109 INFO [RegionServerTracker-0]
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Added
c3-hadoop-srv-st639.bj,13700,1591932264018 to dead servers which
carryingMeta=false, submitted ServerCrashProcedure pid=97428
2020-06-12,15:51:12,109 INFO
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$ServerEventsListenerThread-c3-hadoop-miui-zk05.bj,13600,1591927126881]
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$ServerEventsListenerThread:
Updating default servers.
2020-06-12,15:51:12,111 INFO [PEWorker-11]
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=97428,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=c3-hadoop-srv-st639.bj,13700,1591932264018, splitWal=true, meta=false
{code}
After discussion with [~zghao] offline, we could accelerate this process by
sending the message to the master or deleting the ephemeral node itself before
stop.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)