[ 
https://issues.apache.org/jira/browse/HBASE-27016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ECFuzz updated HBASE-27016:
---------------------------
    Component/s: master
                     (was: shutdown)

> when “zookeeper.session.timeout” is set to a negative number, HMaster cannot 
> stop normally
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-27016
>                 URL: https://issues.apache.org/jira/browse/HBASE-27016
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.2.2
>         Environment: HBase 2.2.2
> os.name=Linux
> os.arch=amd64
> os.version=5.4.0-72-generic
> java.version=1.8.0_211
> java.vendor=Oracle Corporation
>            Reporter: ECFuzz
>            Assignee: ECFuzz
>            Priority: Major
>
> In hbase- default.xml
> {code:java}
> <property>
>   <name>zookeeper.session.timeout</name>
>   <value>90000</value>
>   <description>ZooKeeper session timeout in milliseconds. It is used in two 
> different ways.
>       First, this value is used in the ZK client that HBase uses to connect 
> to the ensemble.
>       It is also used by HBase when it starts a ZK server and it is passed as 
> the 'maxSessionTimeout'.
>       See 
> https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions.
>       For example, if an HBase region server connects to a ZK ensemble that's 
> also managed
>       by HBase, then the session timeout will be the one specified by this 
> configuration.
>       But, a region server that connects to an ensemble managed with a 
> different configuration
>       will be subjected that ensemble's maxSessionTimeout. So, even though 
> HBase might propose
>       using 90 seconds, the ensemble can have a max timeout lower than this 
> and it will take
>       precedence. The current default maxSessionTimeout that ZK ships with is 
> 40 seconds, which is lower than
>       HBase's.
>   </description>
> </property>{code}
> The impact caused by setting zookeeper.session.timeout to negative number is 
> different in local mode and pseudo-distributed mode.
>  
> After starting hbase in local mode, hmaster will replace the sessiontimeout 
> configured in this start master() with the default value of 10000 when it is 
> judged as local mode, so setting zookeeper.session.timeout to a negative 
> number in hbase-site.xml will have no effect on the hbase startup process.
> {code:java}
> private int startMaster() {
>   Configuration conf = getConf();
>   TraceUtil.initTracer(conf);
>   try {
>     // If 'local', defer to LocalHBaseCluster instance.  Starts master
>     // and regionserver both in the one JVM.
>     if (LocalHBaseCluster.isLocal(conf)) {
>       ...
>       int localZKClusterSessionTimeout;
>       localZKClusterSessionTimeout = 
> conf.getInt(HConstants.ZK_SESSION_TIMEOUT + ".localHBaseCluster", 10*1000);
>       conf.setInt(HConstants.ZK_SESSION_TIMEOUT, 
> localZKClusterSessionTimeout);
>       LOG.info("Starting a zookeeper cluster");
>       ...
>       }
> {code}
> Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml 
> to start hbase in local mode normally.
> {code:java}
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh 
> running master, logging to 
> /home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
> 25667 HMaster
> 21859 Jps{code}
> {code:java}
> 2022-05-09 11:16:57,571 INFO [master/ljq1:16000:becomeActiveMaster] 
> master.HMaster: Master has completed initialization 8.179sec{code}
> However, next, when the hbase shell as a client initiates a request for basic 
> database operations to the started HBase, the zookeeper.session.timeout is 
> negative causing a ConnectionLoss error and any requested operation is 
> invalid as it recreates the conf to load the configuration.
> {code:java}
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
> Use "help" to get list of supported commands.
> Use "exit" to quit this interactive shell.
> For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
> Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12 
> UTC 2019
> Took 0.0212 seconds                                                           
>    hbase(main):001:0> status
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "status"'
> Took 34.2850 seconds                                                          
>   hbase(main):002:0> create 'aaaaa','a11111','a222222'
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "create"'
> Took 33.2029 seconds{code}
> The log shows that a socket connection was received, but consistently throws 
> a warning that no additional data could be read from the client session and 
> that the client may have closed the socket.
>  
> Next, use the official termination method . /stop-hbase.sh to shut down 
> hbase, it will also recreate the conf and load the configuration when 
> shutting down. zookeeper.session.timeout will be negative and the terminal 
> will be stuck in an infinite wait, and eventually the process will have to be 
> terminated by kill -9, which is the official termination method . 
> /stop-hbase.sh to shut down hbase is not effective.
> {code:java}
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh 
> stopping 
> hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
>  stopping stop-master-command with kill -9 25667{code}
> Using the official termination method . /stop-hbase.sh when shutting down 
> hbase, the log shows that the stop request was received as well as the socket 
> connection, but still no additional data can be read from the client session, 
> probably with the warning that the client has closed the socket.
> {code:java}
> 2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
> 2022-05-09 14:35:11,588 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
> 2022-05-09 14:35:12,709 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
> 2022-05-09 14:35:13,810 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
> 2022-05-09 14:35:14,912 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
> 2022-05-09 14:35:16,013 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
> 2022-05-09 14:35:17,115 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
> 2022-05-09 14:35:18,217 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
> 2022-05-09 14:35:19,321 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
> ...
> 2022-05-09 14:36:18,846 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
> 2022-05-09 14:36:19,033 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,034 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344 
> (no session established for client)
> 2022-05-09 14:36:19,034 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,035 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342 
> (no session established for client)
> 2022-05-09 14:36:19,035 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829){code}
> Although, before finally using kill -9 to force a shutdown, the logs show 
> that the reason for not being able to shut down properly may be because the 
> zookeeper client session timed out and the client session did not receive a 
> message from the server within 0ms and was closing the socket connection and 
> trying to reconnect.
> {code:java}
> 2022-05-09 14:55:13,486 INFO [main] zookeeper.ZooKeeper: Initiating client 
> connection, connectString=localhost:2181 sessionTimeout=-1 
> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
> 2022-05-09 14:55:13,589 INFO [main-SendThread(localhost:2181)] 
> zookeeper.ClientCnxn: Opening socket connection to server 
> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL 
> (unknown error)
> 2022-05-09 14:55:13,621 WARN [main-SendThread(localhost:2181)] 
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
> 0ms for sessionid 0x0
> 2022-05-09 14:55:13,622 INFO [main-SendThread(localhost:2181)] 
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
> 0ms for sessionid 0x0, closing socket connection and attempting reconnect
> 2022-05-09 14:55:13,737 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
> 2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
> java.lang.IllegalArgumentException: Invalid path string 
> "/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name 
> specified @10
> at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
> at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
> at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
> at 
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
> 2022-05-09 14:55:13,747 INFO [main-EventThread] zookeeper.ClientCnxn: 
> EventThread shut down for session: 0x0{code}
> It can be speculated that the process of shutting down hbase lacks a 
> mechanism to determine and handle the impact of a negative 
> zookeeper.session.timeout, resulting in the process of repeatedly closing the 
> socket connection and trying to reconnect, without actually performing the 
> operation of terminating hbase.
> And in local mode, the set value of zookeeper.session.timeout in 
> hbase-site.xml is replaced with the default value in the startup phase, while 
> the set value is used in the run phase and shutdown, and this configuration 
> item is used inconsistently in different phases.
>  
> In pseudo-distributed mode, when zookeeper.session.timeout is set to a 
> negative number, in the startup hbase phase, unlike the local mode replaces 
> zookeeper.session.timeout with the default value of 10000, which can be 
> checked in time for the exception caused by the negative value of the field. 
> The log shows that the connection is closed after several attempts to 
> reconnect after the client and the collection establish a connection judgment 
> timeout, and then the zookeeper client event thread is closed , further 
> leading to the failure to create the regionserver and the master node exits.
> The phenomenon observed throughout the process is that at first the Hbase 
> daemons can be started normally, but after a period of time only the process 
> HQuorumPeer remains, and the function cannot be used normally.
>  
> To summarize, for the effect of setting zookeeper.session.timeout to a 
> negative number, hbase can be started normally in local mode, but no valid 
> hbase database operations can be initiated, and using the official 
> termination method . /stop-hbase.sh does not work to shut down hbase. In 
> pseudo-distributed mode, hbase can be started normally but some of the 
> daemons exit after a period of time and the functionality cannot be used 
> normally.
> In both modes, there is a lack of judgment on the impact of negative 
> zookeeper.session.timeout and the related handling mechanism is not perfect, 
> and there is a lack of clear and effective information, which makes it 
> impossible to locate and handle exceptions in time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to