[ 
https://issues.apache.org/jira/browse/HBASE-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957981#comment-13957981
 ] 

Jeffrey Zhong commented on HBASE-10895:
---------------------------------------

I haven't sync trunk for a while. Yeah trunk branch does has logic on 
FailedServerException. 

But I think the change there introduces a double assignment situation. Because 
FailedServerException sometimes is caused by a transient network error(in my 
situation caused by some temporally security issue), the region is still open 
on the old RS while the same region will be allowed to be reassigned because 
the recent change let the assignment continue.
 

> unassign a region fails due to the hosting region server is in 
> FailedServerList
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-10895
>                 URL: https://issues.apache.org/jira/browse/HBASE-10895
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.96.1, 0.98.1, 0.99.0
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>         Attachments: hbase-10895.patch
>
>
> This issue is similar as HBASE-10833 which deal with the sendRegionOpen RPC 
> while the JIRA issue happens with sendRegionClose.
> Once a RS in in failed server list due to a network hiccup, AM quickly 
> exhausted all retries and failed the whole region assignment later. Below is 
> a sample stack trace:
> {noformat}
> 2014-03-31 13:39:10,056 INFO  [AM.-pool1-t8] master.AssignmentManager: Server 
> hor16n09.gq1.ygridcore.net,60020,1396270942046 returned 
> org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is 
> in the failed servers list: hor16n09.gq1.ygridcore.net/68.142.246.220:60020 
> for loadtest_d1,59999994,1396261861562.fcef8d691632e99948fbf876d24f907e., 
> try=20 of 20
> org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is 
> in the failed servers list: hor16n09.gq1.ygridcore.net/68.142.246.220:60020
>         at 
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:880)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1065)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.tracedWriteRequest(RpcClient.java:1032)
>         at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1474)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.closeRegion(AdminProtos.java:20854)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.closeRegion(ProtobufUtil.java:1656)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:693)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1685)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1786)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1436)
>         at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:45)
> ....
> 2014-03-31 13:39:10,056 WARN  [AM.-pool1-t8] master.RegionStates: Failed to 
> open/close fcef8d691632e99948fbf876d24f907e on 
> hor16n09.gq1.ygridcore.net,60020,1396270942046, set to FAILED_CLOSE
> 2014-03-31 13:39:10,056 INFO  [AM.-pool1-t8] master.RegionStates: 
> Transitioned {fcef8d691632e99948fbf876d24f907e state=PENDING_OPEN, 
> ts=1396273149814, server=hor16n09.gq1.ygridcore.net,60020,1396270942046} to 
> {fcef8d691632e99948fbf876d24f907e state=FAILED_CLOSE, ts=1396273150056, 
> server=hor16n09.gq1.ygridcore.net,60020,1396270942046}
> 2014-03-31 13:39:10,056 INFO  [AM.-pool1-t8] master.AssignmentManager: Skip 
> assigning {ENCODED => fcef8d691632e99948fbf876d24f907e, NAME => 
> 'loadtest_d1,59999994,1396261861562.fcef8d691632e99948fbf876d24f907e.', 
> STARTKEY => '59999994', ENDKEY => '66666660'}, we couldn't close it: 
> {fcef8d691632e99948fbf876d24f907e state=FAILED_CLOSE, ts=1396273150056, 
> server=hor16n09.gq1.ygridcore.net,60020,1396270942046}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to