[jira] [Updated] (HBASE-21851) meta failure (or slow meta) can cause master to sortof deadlock

Sergey Shelukhin (JIRA) Tue, 05 Feb 2019 14:36:01 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HBASE-21851:
-------------------------------------
    Description: 
Due to many threads sync-retrying to update meta for a really long time, master 
doesn't appear to have enough threads to process requests.
Meta server died but it's SCP is not processed, I'm not sure if it's because of 
the threads being full, or some other reason (the ZK issue we've seen earlier 
in our cluster?)

{noformat}
2019-02-05 13:20:39,225 INFO  [KeepAlivePEWorker-32] 
assignment.RegionStateStore: pid=805758 updating hbase:meta 
row=7130dac84857699b8cd0061298b6fe9c, regionState=OPENING, 
regionLocation=server,17020,1549400274239                                       
                                                                                
       
...
2019-02-05 13:39:42,521 WARN  [ProcExecTimeout] procedure2.ProcedureExecutor: 
Worker stuck KeepAlivePEWorker-32(pid=805758), run time 19mins, 3.296sec
{noformat}                              

It starts dropping timed out calls:
{noformat}
2019-02-05 13:39:45,877 WARN  
[RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] ipc.RpcServer: 
Dropping timed out call: callId: 7 service: RegionServerStatusService 
methodName: RegionServerReport size: 102 connection: ...:35743 deadline: 
1549401663387 ...

RS:
2019-02-05 13:39:45,521 INFO  [RS_OPEN_REGION-regionserver/..:17020-4] 
regionserver.HRegionServer: Failed report transition server ...
org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on ..., 
too many items queued ?
{noformat}

This eventually causes RSes to kill themselves I think and further increases 
load on master.

I wonder if meta retry should be async? That way other calls could be processed 
if meta server is just slow, assuming meta cannot be reassigned like in this 
case. If it can, might make sense to move it for slow request (in a separate 
work item).






  was:
Due to many threads sync-retrying to update meta for a really long time, master 
doesn't appear to have enough threads to process requests.
Meta server died but it's SCP is not processed, I'm not sure if it's because of 
the threads being full, or some other reason (the ZK issue we've seen earlier 
in our cluster?)

{noformat}
2019-02-05 13:20:39,225 INFO  [KeepAlivePEWorker-32] 
assignment.RegionStateStore: pid=805758 updating hbase:meta 
row=7130dac84857699b8cd0061298b6fe9c, regionState=OPENING, 
regionLocation=server,17020,1549400274239                                       
                                                                                
       
...
2019-02-05 13:39:42,521 WARN  [ProcExecTimeout] procedure2.ProcedureExecutor: 
Worker stuck KeepAlivePEWorker-32(pid=805758), run time 19mins, 3.296sec
{noformat}                              

It starts dropping timed out calls:
{noformat}
2019-02-05 13:39:45,877 WARN  
[RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] ipc.RpcServer: 
Dropping timed out call: callId: 7 service: RegionServerStatusService 
methodName: RegionServerReport size: 102 connection: ...:35743 deadline: 
1549401663387 ...

RS:
2019-02-05 13:39:45,521 INFO  [RS_OPEN_REGION-regionserver/..:17020-4] 
regionserver.HRegionServer: Failed report transition server ...
org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on ..., 
too many items queued ?
{noformat}

This eventually causes RSes to kill themselves I think and further increases 
load on master.

I wonder if meta retry should be async? That way other calls could be processed.







> meta failure (or slow meta) can cause master to sortof deadlock 
> ----------------------------------------------------------------
>
>                 Key: HBASE-21851
>                 URL: https://issues.apache.org/jira/browse/HBASE-21851
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Due to many threads sync-retrying to update meta for a really long time, 
> master doesn't appear to have enough threads to process requests.
> Meta server died but it's SCP is not processed, I'm not sure if it's because 
> of the threads being full, or some other reason (the ZK issue we've seen 
> earlier in our cluster?)
> {noformat}
> 2019-02-05 13:20:39,225 INFO  [KeepAlivePEWorker-32] 
> assignment.RegionStateStore: pid=805758 updating hbase:meta 
> row=7130dac84857699b8cd0061298b6fe9c, regionState=OPENING, 
> regionLocation=server,17020,1549400274239                                     
>                                                                               
>            
> ...
> 2019-02-05 13:39:42,521 WARN  [ProcExecTimeout] procedure2.ProcedureExecutor: 
> Worker stuck KeepAlivePEWorker-32(pid=805758), run time 19mins, 3.296sec
> {noformat}                              
> It starts dropping timed out calls:
> {noformat}
> 2019-02-05 13:39:45,877 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] ipc.RpcServer: 
> Dropping timed out call: callId: 7 service: RegionServerStatusService 
> methodName: RegionServerReport size: 102 connection: ...:35743 deadline: 
> 1549401663387 ...
> RS:
> 2019-02-05 13:39:45,521 INFO  [RS_OPEN_REGION-regionserver/..:17020-4] 
> regionserver.HRegionServer: Failed report transition server ...
> org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on ..., 
> too many items queued ?
> {noformat}
> This eventually causes RSes to kill themselves I think and further increases 
> load on master.
> I wonder if meta retry should be async? That way other calls could be 
> processed if meta server is just slow, assuming meta cannot be reassigned 
> like in this case. If it can, might make sense to move it for slow request 
> (in a separate work item).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21851) meta failure (or slow meta) can cause master to sortof deadlock

Reply via email to