[ 
https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753130#comment-15753130
 ] 

Allan Yang edited comment on HBASE-17275 at 12/16/16 1:56 AM:
--------------------------------------------------------------

[~ted_yu] can you or find someone to look at HBASE-17264,  HBASE-17265 and 
HBASE-17275, they are all related, can the fixes really take effects in our 
environment. 


was (Author: allan163):
[~tedyu] can you or find someone to look at HBASE-17264,  HBASE-17265 and 
HBASE-17275, they are all related, can the fixes really take effects in our 
environment. 

> Assign timeout cause region unassign forever
> --------------------------------------------
>
>                 Key: HBASE-17275
>                 URL: https://issues.apache.org/jira/browse/HBASE-17275
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 1.2.3, 1.1.7
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>         Attachments: HBASE-17275-branch-1.patch
>
>
> This is a real cased happened in my test cluster.
> I have more 8000 regions to assign when I restart a cluster, but I only 
> started one regionserver. That means master need to assign these 8000 regions 
> to a single server(I know it is not right, but just for testing).
> The rs recevied the open region rpc and began to open regions. But the due to 
> the hugh number of regions, , master timeout the rpc call(but actually some 
> region had already opened) after 1 mins, as you can see from log 1.
> {noformat}
> 1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
> master.AssignmentManager: Unable to communicate with 
> example.org,30003,1479780976834 in order to assign regions,
> java.io.IOException: Call to /example.org:30003 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
> operationTimeout=60000 expired.
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
>         at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
>         at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
>         at java.lang.Thread.run(Thread.java:756)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
> waitTime=60001, operationTimeout=60000 expired.
>         at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
>         ... 14 more  
> {noformat}
> for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
> timeout, master use a pool to re-assign them, as in 2
> {noformat}
> 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} 
>  
> {noformat}
> But, this region was actually opened on the rs, but (maybe) due to the hugh 
> pressure, the OPENED zk event recevied by master , as you can tell from 3, 
> "which is more than 15 seconds late"
> {noformat}
> 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479780992078, server=example.org,30003,1479780976834}
> {noformat}
> In the meantime, master still try to re-assign this region in another thread. 
> Master first close this region in case of multi assign, then change the state 
> of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node 
> in zk was also transitioned to OFFLINE, as in 4,5,6,7
> {noformat}
> 4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
> Sent CLOSE to example.org,30003,1479780976834 for region 
> test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615.
> 5. 2016-11-22 10:17:32,461 INFO  [AM.-pool1-t26] master.RegionStates: 
> Transition {7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479781052344, server=example.org,30003,1479780976834} to 
> {7e9aee32eb98a6fc9d503b99fc5f9615 state=OFFLINE, ts=1479781052461, 
> server=example.org,30003,1479780976834}
> 6. 2016-11-22 10:17:32,469 DEBUG [AM.-pool1-t26] zookeeper.ZKAssign: 
> master:30001-0x15810b5f633015e, 
> quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181,
>  baseZNode=/test-hbase11-func2 Creating (or updating) unassigned node 
> 7e9aee32eb98a6fc9d503b99fc5f9615 with OFFLINE state
> 7. 2016-11-22 10:17:32,546 INFO  [AM.-pool1-t26] master.AssignmentManager: 
> Assigning test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615. to 
> example.org,30003,1479780976834
> {noformat}
> The thread handling the OPENED zk event still operating on this region too. 
> IT change the state the region state from PENDING_OPEN to OPENED, but when 
> trying to delete the RIT zk node, it found out this zk node was in a illeagl 
> state OFFLINE, which was set in 6. So the RIT node can't be deleted. Now, the 
> inconsistent case appeared. For this region,  its state shows OPENED, but in 
> RIT node, its state shows OFFLINE 
> {noformat}
> 8. 2016-11-22 10:17:32,563 INFO  [AM.ZK.Worker-pool2-t3] master.RegionStates: 
> Transition {7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
> ts=1479781052546, server=example.org,30003,1479780976834} to 
> {7e9aee32eb98a6fc9d503b99fc5f9615 state=OPEN, ts=1479781052563, 
> server=example.org,30003,1479780976834}
> 9. 2016-11-22 10:17:32,595 WARN  [AM.ZK.Worker-pool2-t3] zookeeper.ZKAssign: 
> master:30001-0x15810b5f633015e, 
> quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181,
>  baseZNode=/test-hbase11-func2 Attempting to delete unassigned node 
> 7e9aee32eb98a6fc9d503b99fc5f9615 in RS_ZK_REGION_OPENED state but node is in 
> M_ZK_REGION_OFFLINE state
> 10. 2016-11-22 10:17:32,595 WARN  [AM.ZK.Worker-pool2-t12] 
> zookeeper.ZKAssign: master:30001-0x15810b5f633015e, 
> quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181,
>  baseZNode=/test-hbase11-func2 Attempting to delete unassigned node 
> 4fb51c279d694f8808739ea632673584 in RS_ZK_REGION_OPENED state but node is in 
> M_ZK_REGION_OFFLINE state
> 11. 2016-11-22 10:17:32,595 ERROR [AM.ZK.Worker-pool2-t3] 
> coordination.ZkOpenRegionCoordination: Znode of region 
> 7e9aee32eb98a6fc9d503b99fc5f9615 could not be deleted.
> {noformat}
> After a while, the second assign's zk event arrived, due to the inconsistent 
> state(expect PENDING_OPEN but OPENED), master unassign this region, then this 
> region will remain unassigned
> {noformat}
> 12. 2016-11-22 10:17:51,305 DEBUG [AM.ZK.Worker-pool2-t20] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=example.org,30003,1479780976834, 
> region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
> current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=OPEN, ts=1479781052563, 
> server=example.org,30003,1479780976834}
> 13. 2016-11-22 10:17:51,305 WARN  [AM.ZK.Worker-pool2-t20] 
> master.AssignmentManager: Received OPENED for 
> 7e9aee32eb98a6fc9d503b99fc5f9615 from example.org,30003,1479780976834 but the 
> region isn't PENDING_OPEN/OPENING here: {7e9aee32eb98a6fc9d503b99fc5f9615 
> state=OPEN, ts=1479781052563, server=example.org,30003,1479780976834}
> 14. 2016-11-22 10:17:51,308 DEBUG [AM.ZK.Worker-pool2-t20] 
> master.AssignmentManager: Sent CLOSE to example.org,30003,1479780976834 for 
> region test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615.
> {noformat}
> Restart master may solve this issue, but due to HBASE-17264 and HBASE-17265, 
> restart master may still not helping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to