[ 
https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790411#comment-16790411
 ] 

lujie commented on HBASE-22041:
-------------------------------

Attaching the bug execution logs and normal  execution  log for comparison. The 
RegionServer are :hadoop12,hadoop13,hadoop14
h4. 1 In the normal execution, we crash the hadoop13 who holds the meta data, 
everything is ok, the log can be like:

 
{code:java}
1 master.ServerManager: Registering regionserver=hadoop12,16020,1552412058473
2 master.ServerManager: Registering regionserver=hadoop13,16020,1552412046289
3 master.ServerManager: Registering regionserver=hadoop14,16020,1552412063546
4 zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as hadoop13,16020,1552412046289
5 master.RegionServerTracker: RegionServer ephemeral node created, adding 
[hadoop13,16020,1552412046289]
6 master.RegionServerTracker: RegionServer ephemeral node created, adding 
[hadoop12,16020,1552412058473]
7 master.RegionServerTracker: RegionServer ephemeral node created, adding 
[hadoop14,16020,1552412063546]

8 master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [hadoop13,16020,1552412046289]
9 master.ServerManager: Processing expiration of hadoop13,16020,1552412046289 
on hadoop11,16000,1552412053502
{code}
log#1,2,3 show that hadoop12,13,14 are added to  "onlineServers" (in 
ServerManager).

log#8,9 shows that master detect hadoop13 crash and will remove it from the the 
field "onlineServers" of ServerManager.

 
h4. 2 In the bug execution, we crash the hadoop14 and the 
RegionServerTracker#refresh slow down(we inject sleep to simulate), the log 
becomes:

 
{code:java}
 1 master.ServerManager: Registering regionserver=hadoop14,16020,1552410583724
 2 master.ServerManager: Registering regionserver=hadoop12,16020,1552410578454
 3 master.ServerManager: Registering regionserver=hadoop13,16020,1552410566504
 4 zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as hadoop14,16020,1552410583724

 5 master.RegionServerTracker: RegionServer ephemeral node created, adding 
[hadoop12,16020,1552410578454]
 6 master.RegionServerTracker: RegionServer ephemeral node created, adding 
[hadoop13,16020,1552410566504]
 7 procedure.RSProcedureDispatcher: request to server 
hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
hadoop14/172.16.1.131:16020 failed on connection exception: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 syscall:getsockopt(..) failed: Connection refused: 
hadoop14/172.16.1.131:16020, try=0, retrying...

{code}
log#1,2,3 show that hadoop12,13,14 are added to  "onlineServers" (in 
ServerManager).

But log#5,6  shows that master only create ephemeral node for hadoop12, 13, not 
for hadoop14.

And master can't detect the hadoop14 crash, so hadoop14 will exist in 
onlineServers forever.

In 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.ExecuteProceduresRemoteCall.run

 
{code:java}
1 try {
2   sendRequest(getServerName(), request.build());
3 } catch (IOException e) {
4   e = unwrapException(e);
5   // TODO: In the future some operation may want to bail out early.
6   // TODO: How many times should we retry (use numberOfAttemptsSoFar)
7   if (!scheduleForRetry(e)) {
8     remoteCallFailed(procedureEnv, e);
9   }
10 }
{code}
master will sendReust to hadoop14 and fails, so it will call scheduleForRetry 
to retry , and in scheduleForRetry , master will check whether hadoop14 is in 
onlineServers , if is, retry, hence master will retry forever and print 
thousands of logs like log #7. 

I think we can fix this bugs by give a threshold for numberOfAttemptsSoFar.

The TODO comments at line#5,6 also  shows that we need and how fix this bug.

 

 

 

 

 

> Master stuck in startup and print "FailedServerException" forever
> -----------------------------------------------------------------
>
>                 Key: HBASE-22041
>                 URL: https://issues.apache.org/jira/browse/HBASE-22041
>             Project: HBase
>          Issue Type: Bug
>            Reporter: lujie
>            Priority: Critical
>         Attachments: bug.zip, normal.zip
>
>
> while master fresh boot, we  crash (kill- 9) the RS who hold meta. we find 
> that the master startup fails and print  thounds of logs like:
> {code:java}
> 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to 
> hadoop14/172.16.1.131:16020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> hadoop14/172.16.1.131:16020, try=0, retrying...
> 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying...
> 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying...
> 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying...
> 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying...
> 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying...
> 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying...
> 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying...
> 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] 
> procedure.RSProcedureDispatcher: request to server 
> hadoop14,16020,1552410583724 failed due to 
> org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
> hadoop14/172.16.1.131:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to