[
https://issues.apache.org/jira/browse/HBASE-27016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ECFuzz updated HBASE-27016:
---------------------------
Component/s: shutdown
> when “zookeeper.session.timeout” is set to a negative number, HMaster cannot
> stop normally
> ------------------------------------------------------------------------------------------
>
> Key: HBASE-27016
> URL: https://issues.apache.org/jira/browse/HBASE-27016
> Project: HBase
> Issue Type: Bug
> Components: shutdown
> Affects Versions: 2.2.2
> Environment: HBase 2.2.2
> os.name=Linux
> os.arch=amd64
> os.version=5.4.0-72-generic
> java.version=1.8.0_211
> java.vendor=Oracle Corporation
> Reporter: ECFuzz
> Assignee: ECFuzz
> Priority: Major
>
> In hbase- default.xml
> {code:java}
> <property>
> <name>zookeeper.session.timeout</name>
> <value>90000</value>
> <description>ZooKeeper session timeout in milliseconds. It is used in two
> different ways.
> First, this value is used in the ZK client that HBase uses to connect
> to the ensemble.
> It is also used by HBase when it starts a ZK server and it is passed as
> the 'maxSessionTimeout'.
> See
> https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions.
> For example, if an HBase region server connects to a ZK ensemble that's
> also managed
> by HBase, then the session timeout will be the one specified by this
> configuration.
> But, a region server that connects to an ensemble managed with a
> different configuration
> will be subjected that ensemble's maxSessionTimeout. So, even though
> HBase might propose
> using 90 seconds, the ensemble can have a max timeout lower than this
> and it will take
> precedence. The current default maxSessionTimeout that ZK ships with is
> 40 seconds, which is lower than
> HBase's.
> </description>
> </property>{code}
> The impact caused by setting zookeeper.session.timeout to negative number is
> different in local mode and pseudo-distributed mode.
>
> After starting hbase in local mode, hmaster will replace the sessiontimeout
> configured in this start master() with the default value of 10000 when it is
> judged as local mode, so setting zookeeper.session.timeout to a negative
> number in hbase-site.xml will have no effect on the hbase startup process.
> {code:java}
> private int startMaster() {
> Configuration conf = getConf();
> TraceUtil.initTracer(conf);
> try {
> // If 'local', defer to LocalHBaseCluster instance. Starts master
> // and regionserver both in the one JVM.
> if (LocalHBaseCluster.isLocal(conf)) {
> ...
> int localZKClusterSessionTimeout;
> localZKClusterSessionTimeout =
> conf.getInt(HConstants.ZK_SESSION_TIMEOUT + ".localHBaseCluster", 10*1000);
> conf.setInt(HConstants.ZK_SESSION_TIMEOUT,
> localZKClusterSessionTimeout);
> LOG.info("Starting a zookeeper cluster");
> ...
> }
> {code}
> Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml
> to start hbase in local mode normally.
> {code:java}
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh
> running master, logging to
> /home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
> 25667 HMaster
> 21859 Jps{code}
> {code:java}
> 2022-05-09 11:16:57,571 INFO [master/ljq1:16000:becomeActiveMaster]
> master.HMaster: Master has completed initialization 8.179sec{code}
> However, next, when the hbase shell as a client initiates a request for basic
> database operations to the started HBase, the zookeeper.session.timeout is
> negative causing a ConnectionLoss error and any requested operation is
> invalid as it recreates the conf to load the configuration.
> {code:java}
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
> Use "help" to get list of supported commands.
> Use "exit" to quit this interactive shell.
> For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
> Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12
> UTC 2019
> Took 0.0212 seconds
> hbase(main):001:0> status
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "status"'
> Took 34.2850 seconds
> hbase(main):002:0> create 'aaaaa','a11111','a222222'
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "create"'
> Took 33.2029 seconds{code}
> The log shows that a socket connection was received, but consistently throws
> a warning that no additional data could be read from the client session and
> that the client may have closed the socket.
>
> Next, use the official termination method . /stop-hbase.sh to shut down
> hbase, it will also recreate the conf and load the configuration when
> shutting down. zookeeper.session.timeout will be negative and the terminal
> will be stuck in an infinite wait, and eventually the process will have to be
> terminated by kill -9, which is the official termination method .
> /stop-hbase.sh to shut down hbase is not effective.
> {code:java}
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh
> stopping
> hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
> stopping stop-master-command with kill -9 25667{code}
> Using the official termination method . /stop-hbase.sh when shutting down
> hbase, the log shows that the stop request was received as well as the socket
> connection, but still no additional data can be read from the client session,
> probably with the warning that the client has closed the socket.
> {code:java}
> 2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
> 2022-05-09 14:35:11,588 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
> 2022-05-09 14:35:12,709 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
> 2022-05-09 14:35:13,810 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
> 2022-05-09 14:35:14,912 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
> 2022-05-09 14:35:16,013 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
> 2022-05-09 14:35:17,115 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
> 2022-05-09 14:35:18,217 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
> 2022-05-09 14:35:19,321 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
> ...
> 2022-05-09 14:36:18,846 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
> 2022-05-09 14:36:19,033 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,034 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344
> (no session established for client)
> 2022-05-09 14:36:19,034 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,035 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342
> (no session established for client)
> 2022-05-09 14:36:19,035 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829){code}
> Although, before finally using kill -9 to force a shutdown, the logs show
> that the reason for not being able to shut down properly may be because the
> zookeeper client session timed out and the client session did not receive a
> message from the server within 0ms and was closing the socket connection and
> trying to reconnect.
> {code:java}
> 2022-05-09 14:55:13,486 INFO [main] zookeeper.ZooKeeper: Initiating client
> connection, connectString=localhost:2181 sessionTimeout=-1
> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
> 2022-05-09 14:55:13,589 INFO [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server
> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
> (unknown error)
> 2022-05-09 14:55:13,621 WARN [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in
> 0ms for sessionid 0x0
> 2022-05-09 14:55:13,622 INFO [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in
> 0ms for sessionid 0x0, closing socket connection and attempting reconnect
> 2022-05-09 14:55:13,737 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
> 2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
> java.lang.IllegalArgumentException: Invalid path string
> "/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name
> specified @10
> at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
> at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
> at
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
> 2022-05-09 14:55:13,747 INFO [main-EventThread] zookeeper.ClientCnxn:
> EventThread shut down for session: 0x0{code}
> It can be speculated that the process of shutting down hbase lacks a
> mechanism to determine and handle the impact of a negative
> zookeeper.session.timeout, resulting in the process of repeatedly closing the
> socket connection and trying to reconnect, without actually performing the
> operation of terminating hbase.
> And in local mode, the set value of zookeeper.session.timeout in
> hbase-site.xml is replaced with the default value in the startup phase, while
> the set value is used in the run phase and shutdown, and this configuration
> item is used inconsistently in different phases.
>
> In pseudo-distributed mode, when zookeeper.session.timeout is set to a
> negative number, in the startup hbase phase, unlike the local mode replaces
> zookeeper.session.timeout with the default value of 10000, which can be
> checked in time for the exception caused by the negative value of the field.
> The log shows that the connection is closed after several attempts to
> reconnect after the client and the collection establish a connection judgment
> timeout, and then the zookeeper client event thread is closed , further
> leading to the failure to create the regionserver and the master node exits.
> The phenomenon observed throughout the process is that at first the Hbase
> daemons can be started normally, but after a period of time only the process
> HQuorumPeer remains, and the function cannot be used normally.
>
> To summarize, for the effect of setting zookeeper.session.timeout to a
> negative number, hbase can be started normally in local mode, but no valid
> hbase database operations can be initiated, and using the official
> termination method . /stop-hbase.sh does not work to shut down hbase. In
> pseudo-distributed mode, hbase can be started normally but some of the
> daemons exit after a period of time and the functionality cannot be used
> normally.
> In both modes, there is a lack of judgment on the impact of negative
> zookeeper.session.timeout and the related handling mechanism is not perfect,
> and there is a lack of clear and effective information, which makes it
> impossible to locate and handle exceptions in time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)