[jira] [Commented] (HBASE-21576) master should proactively reassign meta when killing a RS with it

Sergey Shelukhin (JIRA) Tue, 22 Jan 2019 16:07:28 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749312#comment-16749312
 ]


Sergey Shelukhin commented on HBASE-21576:
------------------------------------------

I filed a separate bug somewhere, RS aborting due to DroppedSnapshot actually 
does try to close regions, that is the part that took a long time.
HBASE-21577 should address this at some point.
Also we've seen an issue once where for whatever reason master didn't detect 
that RS died via ZK node until some other RS also died (ZK notification was 
lost somehow?)... I filed HBASE-21744 to mitigate that.

One thing I found since then is that master's "aborting RS" message is actually 
purely informational, RS sends a message saying it's going to die and master 
logs it. So this issue is not really relevant, because indeed master would have 
to wait for SCP to do recovery (I was assuming master could delay the death of 
the RS and move meta first, then let RS proceed). 

> master should proactively reassign meta when killing a RS with it
> -----------------------------------------------------------------
>
>                 Key: HBASE-21576
>                 URL: https://issues.apache.org/jira/browse/HBASE-21576
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Master has killed an RS that was hosting meta due to some HDFS issue (most 
> likely; I've lost the RS logs due to HBASE-21575).
> RS took a very long time to die (again, might be a separate bug, I'll file if 
> I see repro), and a long time to restart; meanwhile master never tried to 
> reassign meta, and eventually killed itself not being able to update it.
> It seems like a RS on a bad machine would be especially prone to slow 
> abort/startup, as well as to issues causing master to kill it, so it would 
> make sense for master to immediately relocate meta once meta-hosting RS is 
> dead after a kill; or even when killing the RS. In the former case (if the RS 
> needs to die for meta to be reassigned safely), perhaps the RS hosting meta 
> in particular should try to die fast in such circumstances, and not do any 
> cleanup.
> {noformat}
> 2018-12-08 04:52:55,144 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] 
> master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal 
> error:
> ***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL 
> required. Forcing server shutdown *****
> .... [aborting for ~7 minutes]
> 2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call 
> exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server 
> <server1>,17020,1544264858183 aborting, details=row '...' on table 
> 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=<server1>,17020,1544264858183, seqNum=-1
> ... [starting for ~5]
> 2018-12-08 04:59:58,574 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] 
> client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, 
> started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on 
> connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: 
> connection timed out: <server1>, details=row '...' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, 
> seqNum=-1
> ... [re-initializing for at least ~7]
> 2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] 
> client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, 
> started=41137 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
> <server1>,17020,1544274145387 is not running yet
> ...
> 2018-12-08 05:11:18,470 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: 
> ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... 
> state=OPEN *****^M
> {noformat}
> There are no signs of meta assignment activity at all in master logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21576) master should proactively reassign meta when killing a RS with it

Reply via email to