[jira] [Commented] (HBASE-14498) Master stuck in infinite loop when all Zookeeper servers are unreachable

Sergey Shelukhin (JIRA) Fri, 15 Feb 2019 13:39:33 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769759#comment-16769759
 ]


Sergey Shelukhin commented on HBASE-14498:
------------------------------------------

This is actually a critical data loss issue, because if RS is network split 
from ZK, master could see its znode expire and reassign the region.
We've just had a ZK outage (not network) and some RSes kept running for hours 
without having a connection, and doing region stuff.

+1 on the patch for now, I will commit Monday if no objections. nit: should use 
HConstants.DEFAULT_ZK_SESSION_TIMEOUT, not 90000.

I think this area needs to be hardened; if the server disconnects just before 
ZK ping update that was somehow delayed for almost the whole session timeout, 
and then waits 2/3rd of the timeout to abort, that means master has to wait 
conntimeout + some time after the znode is gone before reassigning regions. 
Also,  I am not sure abort is good enough because it also takes time and may do 
cleanup and such that affects region directories and WALs. Ideally the server 
should kill -9 itself when losing the lock, or smth close enough.
Also, I wonder if we should use Curator; e.g. use a lock, with master trying to 
steal all server locks - the server is dead as soon as the lock is stolen. 
Although I guess there, SUSPENDED still needs to be handled in a similar way; 
but at least I hope it sends LOST on the lack of connection. Also, it would 
reduce HBase ZK code and benefit from Curator wisdom :)
I'll file a JIRA or two after this patch.


> Master stuck in infinite loop when all Zookeeper servers are unreachable
> ------------------------------------------------------------------------
>
>                 Key: HBASE-14498
>                 URL: https://issues.apache.org/jira/browse/HBASE-14498
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.5.0, 2.0.0
>            Reporter: Y. SREENIVASULU REDDY
>            Assignee: Pankaj Kumar
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: HBASE-14498-V2.patch, HBASE-14498-V3.patch, 
> HBASE-14498-V4.patch, HBASE-14498-V5.patch, HBASE-14498-V6.patch, 
> HBASE-14498-V6.patch, HBASE-14498-addendum.patch, 
> HBASE-14498-branch-1.2.patch, HBASE-14498-branch-1.3-V2.patch, 
> HBASE-14498-branch-1.3.patch, HBASE-14498-branch-1.4.patch, 
> HBASE-14498-branch-1.patch, HBASE-14498.007.patch, HBASE-14498.008.patch, 
> HBASE-14498.master.001.patch, HBASE-14498.master.002.patch, HBASE-14498.patch
>
>
> We met a weird scenario in our production environment.
> In a HA cluster,
> > Active Master (HM1) is not able to connect to any Zookeeper server (due to 
> > N/w breakdown on master machine network with Zookeeper servers).
> {code}
> 2015-09-26 15:24:47,508 INFO 
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)] 
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
> 33463ms for sessionid 0x104576b8dda0002, closing socket connection and 
> attempting reconnect
> 2015-09-26 15:24:47,877 INFO 
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] 
> client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)] 
> client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:49,879 WARN 
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] 
> zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:49,879 INFO 
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] 
> zookeeper.ClientCnxn: Opening socket connection to server 
> ZK-Host1/ZK-IP1:2181. Will not attempt to authenticate using SASL (unknown 
> error)
> 2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)] 
> zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)] 
> zookeeper.ClientCnxn: Opening socket connection to server 
> ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown 
> error)
> 2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)] 
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
> 30023ms for sessionid 0x2045762cc710006, closing socket connection and 
> attempting reconnect
> 2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000] 
> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
> quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181, 
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)] 
> client.FourLetterWordMain: connecting to ZK-Host 2181
> 2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)] 
> zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host
> 2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)] 
> zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181. 
> Will not attempt to authenticate using SASL (unknown error)
> {code}
> > Since HM1 was not able to connect to any ZK, so session timeout didnt 
> > happen at Zookeeper server side and HM1 didnt abort.
> > On Zookeeper session timeout standby master (HM2) registered himself as an 
> > active master. 
> > HM2 is keep on waiting for region server to report him as part of active 
> > master intialization.
> {noformat} 
> 2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting 
> for region servers count to settle; currently checked in 0, slept for 0 ms, 
> expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval 
> of 1500 ms. | 
> org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> ---
> ---
> 2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting 
> for region servers count to settle; currently checked in 0, slept for 483913 
> ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, 
> interval of 1500 ms. | 
> org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> {noformat}
> > At other end, region servers are reporting to HM1 on 3 sec interval. Here 
> > region server retrieve master location from zookeeper only when they 
> > couldn't connect to Master (ServiceException).
> Region Server will not report HM2 as per current design until unless HM1 
> abort,so HM2 will exit(InitializationMonitor) and again wait for region 
> servers in loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-14498) Master stuck in infinite loop when all Zookeeper servers are unreachable

Reply via email to