[ 
https://issues.apache.org/jira/browse/HBASE-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21851:
-------------------------------------
    Summary: meta failure (or slow meta) can cause master to sortof deadlock   
(was: slow meta can cause master to sortof deadlock and bring down the cluster)

> meta failure (or slow meta) can cause master to sortof deadlock 
> ----------------------------------------------------------------
>
>                 Key: HBASE-21851
>                 URL: https://issues.apache.org/jira/browse/HBASE-21851
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Due to many threads sync-retrying to update meta for a really long time, 
> master doesn't appear to have enough threads to process requests.
> Meta server died but it's SCP is not processed, I'm not sure if it's because 
> of the threads being full, or some other reason (the ZK issue we've seen 
> earlier in our cluster?)
> {noformat}
> 2019-02-05 13:20:39,225 INFO  [KeepAlivePEWorker-32] 
> assignment.RegionStateStore: pid=805758 updating hbase:meta 
> row=7130dac84857699b8cd0061298b6fe9c, regionState=OPENING, 
> regionLocation=server,17020,1549400274239                                     
>                                                                               
>            
> ...
> 2019-02-05 13:39:42,521 WARN  [ProcExecTimeout] procedure2.ProcedureExecutor: 
> Worker stuck KeepAlivePEWorker-32(pid=805758), run time 19mins, 3.296sec
> {noformat}                              
> It starts dropping timed out calls:
> {noformat}
> 2019-02-05 13:39:45,877 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] ipc.RpcServer: 
> Dropping timed out call: callId: 7 service: RegionServerStatusService 
> methodName: RegionServerReport size: 102 connection: ...:35743 deadline: 
> 1549401663387 ...
> RS:
> 2019-02-05 13:39:45,521 INFO  [RS_OPEN_REGION-regionserver/..:17020-4] 
> regionserver.HRegionServer: Failed report transition server ...
> org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on ..., 
> too many items queued ?
> {noformat}
> This eventually causes RSes to kill themselves I think and further increases 
> load on master.
> I wonder if meta retry should be async? That way other calls could be 
> processed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to