[ 
https://issues.apache.org/jira/browse/HBASE-19710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-19710:
---------------------------
    Attachment: master-006.tar.gz
                rs-009.log.tar.gz
                master-005-log.tar.gz

009 was the region server log where namespace table was last open.
006 was the master log which first experienced namespace table getting stuck.
005 was the master which became active master next, with namespace table still 
stuck.

> hbase:namespace table was stuck in transition
> ---------------------------------------------
>
>                 Key: HBASE-19710
>                 URL: https://issues.apache.org/jira/browse/HBASE-19710
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Priority: Critical
>         Attachments: master-005-log.tar.gz, master-006.tar.gz, 
> rs-009.log.tar.gz
>
>
> ITBLL with chaos monkey failed due to namespace table getting stuck in 
> transition.
> From hbase-hbase-master-ctr-e137-1514896590304-3629-01-000006.hwx.site.log , 
> we can see that master closed namespace table on 000009:
> {code}
> 2018-01-04 17:24:35,067 DEBUG [main-EventThread] zookeeper.ZKWatcher: 
> master:20000-0x160c222710c0028, 
> quorum=ctr-e137-1514896590304-3629-01-000011.hwx.site:2181,ctr-e137-      
> 1514896590304-3629-01-000014.hwx.site:2181,ctr-e137-1514896590304-3629-01-000009.hwx.site:2181,ctr-e137-1514896590304-3629-01-000006.hwx.site:2181,ctr-e137-1514896590304-3629-
>  
> 01-000003.hwx.site:2181,ctr-e137-1514896590304-3629-01-000007.hwx.site:2181,ctr-e137-1514896590304-3629-01-000013.hwx.site:2181,ctr-e137-1514896590304-3629-01-000002.hwx.site:
>  
> 2181,ctr-e137-1514896590304-3629-01-000012.hwx.site:2181,ctr-e137-1514896590304-3629-01-000008.hwx.site:2181,ctr-e137-1514896590304-3629-01-000010.hwx.site:2181,
>  baseZNode=/   hbase-unsecure Received ZooKeeper Event, 
> type=NodeChildrenChanged, state=SyncConnected, path=/hbase-unsecure/rs
> 2018-01-04 17:24:35,067 INFO  [ProcExecWrkr-5] assignment.RegionStateStore: 
> pid=643 updating hbase:meta 
> row=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.,   
> regionState=CLOSING, 
> regionLocation=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872
> ...
> 2018-01-04 17:24:35,246 INFO  [ProcExecWrkr-12] 
> procedure.MasterProcedureScheduler: pid=647, ppid=642, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:     
> namespace, region=a95ed2d7434a43390fbec73abeeb9fd9 hbase:namespace 
> hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.
> 2018-01-04 17:25:17,041 DEBUG 
> [ctr-e137-1514896590304-3629-01-000006:20000.masterManager] 
> procedure2.ProcedureExecutor: Loading pid=641, 
> state=WAITING:MOVE_REGION_ASSIGN;      MoveRegionProcedure 
> hri=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9., 
> source=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872,    
>         destination=
> {code}
> For the move operation, from ctr-e137-1514896590304-3629-01-000009.hwx.site 
> log:
> {code}
> 2018-01-04 17:24:34,855 DEBUG 
> [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] 
> coprocessor.CoprocessorHost: Stop coprocessor 
> org.apache.hadoop.hbase.security.   access.SecureBulkLoadEndpoint
> 2018-01-04 17:24:34,855 INFO  
> [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] 
> regionserver.HRegion: Closed hbase:namespace,,1515085217343.                  
>     a95ed2d7434a43390fbec73abeeb9fd9.
> 2018-01-04 17:24:34,856 DEBUG 
> [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] 
> handler.CloseRegionHandler: Closed hbase:namespace,,1515085217343.            
>     a95ed2d7434a43390fbec73abeeb9fd9.
> ...
> 2018-01-04 17:25:47,607 DEBUG 
> [RpcServer.priority.FPBQ.Fifo.handler=18,queue=0,port=16020] ipc.RpcServer: 
> callId: 16 service: ClientService methodName: Get size: 103           
> connection: 172.27.13.80:36738 deadline: 1515086837568
> org.apache.hadoop.hbase.NotServingRegionException: 
> hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9. is not 
> online on ctr-e137-1514896590304-3629-01-000009.hwx. site,16020,1515086719163
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3312)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3289)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1354)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2360)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:403)
> {code}
> We can see that the region server was not serving the region.
> After that, the masters kept thinking namespace table was on 0009, leading to 
> prolonged downtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to