[
https://issues.apache.org/jira/browse/HBASE-21851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin updated HBASE-21851:
-------------------------------------
Affects Version/s: 3.0.0
> meta failure (or slow meta) can cause master to sortof deadlock
> ----------------------------------------------------------------
>
> Key: HBASE-21851
> URL: https://issues.apache.org/jira/browse/HBASE-21851
> Project: HBase
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Sergey Shelukhin
> Priority: Major
>
> Due to many threads sync-retrying to update meta for a really long time,
> master doesn't appear to have enough threads to process requests.
> Meta server died but it's SCP is not processed, I'm not sure if it's because
> of the threads being full, or some other reason (the ZK issue we've seen
> earlier in our cluster?)
> {noformat}
> 2019-02-05 13:20:39,225 INFO [KeepAlivePEWorker-32]
> assignment.RegionStateStore: pid=805758 updating hbase:meta
> row=7130dac84857699b8cd0061298b6fe9c, regionState=OPENING,
> regionLocation=server,17020,1549400274239
>
>
> ...
> 2019-02-05 13:39:42,521 WARN [ProcExecTimeout] procedure2.ProcedureExecutor:
> Worker stuck KeepAlivePEWorker-32(pid=805758), run time 19mins, 3.296sec
> {noformat}
> It starts dropping timed out calls:
> {noformat}
> 2019-02-05 13:39:45,877 WARN
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] ipc.RpcServer:
> Dropping timed out call: callId: 7 service: RegionServerStatusService
> methodName: RegionServerReport size: 102 connection: ...:35743 deadline:
> 1549401663387 ...
> RS:
> 2019-02-05 13:39:45,521 INFO [RS_OPEN_REGION-regionserver/..:17020-4]
> regionserver.HRegionServer: Failed report transition server ...
> org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on ...,
> too many items queued ?
> {noformat}
> This eventually causes RSes to kill themselves I think and further increases
> load on master.
> I wonder if meta retry should be async? That way other calls could be
> processed if meta server is just slow, assuming meta cannot be reassigned
> like in this case. If it can, might make sense to move it if the updates are
> too slow for some time (in a separate work item).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)