[jira] [Updated] (HBASE-27016) when “zookeeper.session.timeout” is set to a negative number, HMaster cannot stop normally

Jingxuan Fu (Jira) Mon, 09 May 2022 02:02:09 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jingxuan Fu updated HBASE-27016:
--------------------------------
    Description: 
{{In hbase- default.xml}}
{{<property>}}
{{  <name>zookeeper.session.timeout</name>}}
{{  <value>90000</value>}}
{{  <description>ZooKeeper session timeout in milliseconds. It is used in two 
different ways.}}
{{      First, this value is used in the ZK client that HBase uses to connect 
to the ensemble.}}
{{      It is also used by HBase when it starts a ZK server and it is passed as 
the 'maxSessionTimeout'.}}
{{      See 
[https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions].}}
{{      For example, if an HBase region server connects to a ZK ensemble that's 
also managed}}
{{      by HBase, then the session timeout will be the one specified by this 
configuration.}}
{{      But, a region server that connects to an ensemble managed with a 
different configuration}}
{{      will be subjected that ensemble's maxSessionTimeout. So, even though 
HBase might propose}}
{{      using 90 seconds, the ensemble can have a max timeout lower than this 
and it will take}}
{{      precedence. The current default maxSessionTimeout that ZK ships with is 
40 seconds, which is lower than}}
{{      HBase's.}}
{{  </description>}}
{{</property>}}


The impact caused by setting zookeeper.session.timeout to negative number is 
different in local mode and pseudo-distributed mode.

After starting hbase in local mode, hmaster will replace the sessiontimeout 
configured in this start master() with the default value of 10000 when it is 
judged as local mode, so setting zookeeper.session.timeout to a negative number 
in hbase-site.xml will have no effect on the hbase startup process.

private int startMaster() {
Configuration conf = getConf();
TraceUtil.initTracer(conf);

try {
  // If 'local', defer to LocalHBaseCluster instance. Starts master
  // and regionserver both in the one JVM.
  if (LocalHBaseCluster.isLocal(conf))

{     ...     int localZKClusterSessionTimeout;     
localZKClusterSessionTimeout = conf.getInt(HConstants.ZK_SESSION_TIMEOUT + 
".localHBaseCluster", 10*1000);     conf.setInt(HConstants.ZK_SESSION_TIMEOUT, 
localZKClusterSessionTimeout);     LOG.info("Starting a zookeeper cluster");    
 ...     }

Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml 
to start hbase in local mode normally.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh 
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
running master, logging to 
/home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
25667 HMaster
21859 Jps
2022-05-09 11:16:57,571 INFO [master/ljq1:16000:becomeActiveMaster] 
master.HMaster: Master has completed initialization 8.179sec
However, next, when the hbase shell as a client initiates a request for basic 
database operations to the started HBase, the zookeeper.session.timeout is 
negative causing a ConnectionLoss error and any requested operation is invalid 
as it recreates the conf to load the configuration.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: [http://hbase.apache.org/2.0/book.html#shell]
Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12 
UTC 2019
Took 0.0212 seconds                                                             
                                                                                
                                                                        
hbase(main):001:0> status
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "status"'
Took 34.2850 seconds                                                            
                                                                                
                                                                        
hbase(main):002:0> create 'aaaaa','a11111','a222222'
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "create"'
Took 33.2029 seconds
The log shows that a socket connection was received, but consistently throws a 
warning that no additional data could be read from the client session and that 
the client may have closed the socket.

Next, use the official termination method . /stop-hbase.sh to shut down hbase, 
it will also recreate the conf and load the configuration when shutting down. 
zookeeper.session.timeout will be negative and the terminal will be stuck in an 
infinite wait, and eventually the process will have to be terminated by kill 
-9, which is the official termination method . /stop-hbase.sh to shut down 
hbase is not effective.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh 
stopping 
hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
 stopping stop-master-command with kill -9 25667
Using the official termination method . /stop-hbase.sh when shutting down 
hbase, the log shows that the stop request was received as well as the socket 
connection, but still no additional data can be read from the client session, 
probably with the warning that the client has closed the socket.
2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
2022-05-09 14:35:11,588 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
2022-05-09 14:35:12,709 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
2022-05-09 14:35:13,810 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
2022-05-09 14:35:14,912 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
2022-05-09 14:35:16,013 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
2022-05-09 14:35:17,115 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
2022-05-09 14:35:18,217 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
2022-05-09 14:35:19,321 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
...
2022-05-09 14:36:18,846 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
2022-05-09 14:36:19,033 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,034 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344 (no 
session established for client)
2022-05-09 14:36:19,034 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,035 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342 (no 
session established for client)
2022-05-09 14:36:19,035 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
Although, before finally using kill -9 to force a shutdown, the logs show that 
the reason for not being able to shut down properly may be because the 
zookeeper client session timed out and the client session did not receive a 
message from the server within 0ms and was closing the socket connection and 
trying to reconnect.
2022-05-09 14:55:13,486 INFO [main] zookeeper.ZooKeeper: Initiating client 
connection, connectString=localhost:2181 sessionTimeout=-1 
watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
2022-05-09 14:55:13,589 INFO [main-SendThread(localhost:2181)] 
zookeeper.ClientCnxn: Opening socket connection to server 
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown 
error)
2022-05-09 14:55:13,621 WARN [main-SendThread(localhost:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
0ms for sessionid 0x0
2022-05-09 14:55:13,622 INFO [main-SendThread(localhost:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
0ms for sessionid 0x0, closing socket connection and attempting reconnect
2022-05-09 14:55:13,737 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
java.lang.IllegalArgumentException: Invalid path string 
"/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name 
specified @10
at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
at 
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at 
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
2022-05-09 14:55:13,747 INFO [main-EventThread] zookeeper.ClientCnxn: 
EventThread shut down for session: 0x0
It can be speculated that the process of shutting down hbase lacks a mechanism 
to determine and handle the impact of a negative zookeeper.session.timeout, 
resulting in the process of repeatedly closing the socket connection and trying 
to reconnect, without actually performing the operation of terminating hbase.

And in local mode, the set value of zookeeper.session.timeout in hbase-site.xml 
is replaced with the default value in the startup phase, while the set value is 
used in the run phase and shutdown, and this configuration item is used 
inconsistently in different phases.

 

In pseudo-distributed mode, when zookeeper.session.timeout is set to a negative 
number, in the startup hbase phase, unlike the local mode replaces 
zookeeper.session.timeout with the default value of 10000, which can be checked 
in time for the exception caused by the negative value of the field. The log 
shows that the connection is closed after several attempts to reconnect after 
the client and the collection establish a connection judgment timeout, and then 
the zookeeper client event thread is closed , further leading to the failure to 
create the regionserver and the master node exits.

The phenomenon observed throughout the process is that at first the Hbase 
daemons can be started normally, but after a period of time only the process 
HQuorumPeer remains, and the function cannot be used normally.

 

To summarize, for the effect of setting zookeeper.session.timeout to a negative 
number, hbase can be started normally in local mode, but no valid hbase 
database operations can be initiated, and using the official termination method 
. /stop-hbase.sh does not work to shut down hbase. In pseudo-distributed mode, 
hbase can be started normally but some of the daemons exit after a period of 
time and the functionality cannot be used normally.

In both modes, there is a lack of judgment on the impact of negative 
zookeeper.session.timeout and the related handling mechanism is not perfect, 
and there is a lack of clear and effective information, which makes it 
impossible to locate and handle exceptions in time.

  was:
In hbase- default.xml
<property>
    <name>zookeeper.session.timeout</name>
    <value>90000</value>
    <description>ZooKeeper session timeout in milliseconds. It is used in two 
different ways.
        First, this value is used in the ZK client that HBase uses to connect 
to the ensemble.
        It is also used by HBase when it starts a ZK server and it is passed as 
the 'maxSessionTimeout'.
        See 
https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions.
        For example, if an HBase region server connects to a ZK ensemble that's 
also managed
        by HBase, then the session timeout will be the one specified by this 
configuration.
        But, a region server that connects to an ensemble managed with a 
different configuration
        will be subjected that ensemble's maxSessionTimeout. So, even though 
HBase might propose
        using 90 seconds, the ensemble can have a max timeout lower than this 
and it will take
        precedence. The current default maxSessionTimeout that ZK ships with is 
40 seconds, which is lower than
        HBase's.
    </description>
</property>
The impact caused by setting zookeeper.session.timeout to negative number is 
different in local mode and pseudo-distributed mode.

After starting hbase in local mode, hmaster will replace the sessiontimeout 
configured in this start master() with the default value of 10000 when it is 
judged as local mode, so setting zookeeper.session.timeout to a negative number 
in hbase-site.xml will have no effect on the hbase startup process.
private int startMaster() {
  Configuration conf = getConf();
  TraceUtil.initTracer(conf);

  try {
    // If 'local', defer to LocalHBaseCluster instance.  Starts master
    // and regionserver both in the one JVM.
    if (LocalHBaseCluster.isLocal(conf)) {
      ...
      int localZKClusterSessionTimeout;
      localZKClusterSessionTimeout = conf.getInt(HConstants.ZK_SESSION_TIMEOUT 
+ ".localHBaseCluster", 10*1000);
      conf.setInt(HConstants.ZK_SESSION_TIMEOUT, localZKClusterSessionTimeout);
      LOG.info("Starting a zookeeper cluster");
      ...
      }
Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml 
to start hbase in local mode normally.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh 
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
running master, logging to 
/home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
25667 HMaster
21859 Jps
2022-05-09 11:16:57,571 INFO  [master/ljq1:16000:becomeActiveMaster] 
master.HMaster: Master has completed initialization 8.179sec
However, next, when the hbase shell as a client initiates a request for basic 
database operations to the started HBase, the zookeeper.session.timeout is 
negative causing a ConnectionLoss error and any requested operation is invalid 
as it recreates the conf to load the configuration.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12 
UTC 2019
Took 0.0212 seconds                                                             
                                                                                
                                                                        
hbase(main):001:0> status
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "status"'
Took 34.2850 seconds                                                            
                                                                                
                                                                        
hbase(main):002:0> create 'aaaaa','a11111','a222222'
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "create"'
Took 33.2029 seconds
The log shows that a socket connection was received, but consistently throws a 
warning that no additional data could be read from the client session and that 
the client may have closed the socket.

Next, use the official termination method . /stop-hbase.sh to shut down hbase, 
it will also recreate the conf and load the configuration when shutting down. 
zookeeper.session.timeout will be negative and the terminal will be stuck in an 
infinite wait, and eventually the process will have to be terminated by kill 
-9, which is the official termination method . /stop-hbase.sh to shut down 
hbase is not effective.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh 
stopping 
hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
 stopping stop-master-command with kill -9 25667
Using the official termination method . /stop-hbase.sh when shutting down 
hbase, the log shows that the stop request was received as well as the socket 
connection, but still no additional data can be read from the client session, 
probably with the warning that the client has closed the socket.
2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
2022-05-09 14:35:11,588 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
2022-05-09 14:35:12,709 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
2022-05-09 14:35:13,810 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
2022-05-09 14:35:14,912 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
2022-05-09 14:35:16,013 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
2022-05-09 14:35:17,115 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
2022-05-09 14:35:18,217 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
2022-05-09 14:35:19,321 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
...
2022-05-09 14:36:18,846 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
2022-05-09 14:36:19,033 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
  at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
  at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,034 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344 (no 
session established for client)
2022-05-09 14:36:19,034 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
  at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
  at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,035 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342 (no 
session established for client)
2022-05-09 14:36:19,035 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
  at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
  at java.base/java.lang.Thread.run(Thread.java:829)
Although, before finally using kill -9 to force a shutdown, the logs show that 
the reason for not being able to shut down properly may be because the 
zookeeper client session timed out and the client session did not receive a 
message from the server within 0ms and was closing the socket connection and 
trying to reconnect.
2022-05-09 14:55:13,486 INFO  [main] zookeeper.ZooKeeper: Initiating client 
connection, connectString=localhost:2181 sessionTimeout=-1 
watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
2022-05-09 14:55:13,589 INFO  [main-SendThread(localhost:2181)] 
zookeeper.ClientCnxn: Opening socket connection to server 
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown 
error)
2022-05-09 14:55:13,621 WARN  [main-SendThread(localhost:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
0ms for sessionid 0x0
2022-05-09 14:55:13,622 INFO  [main-SendThread(localhost:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
0ms for sessionid 0x0, closing socket connection and attempting reconnect
2022-05-09 14:55:13,737 INFO  [main] zookeeper.ZooKeeper: Session: 0x0 closed
2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
java.lang.IllegalArgumentException: Invalid path string 
"/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name 
specified @10
  at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
  at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
  at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
  at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
  at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
  at 
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
  at 
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
  at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
2022-05-09 14:55:13,747 INFO  [main-EventThread] zookeeper.ClientCnxn: 
EventThread shut down for session: 0x0
It can be speculated that the process of shutting down hbase lacks a mechanism 
to determine and handle the impact of a negative zookeeper.session.timeout, 
resulting in the process of repeatedly closing the socket connection and trying 
to reconnect, without actually performing the operation of terminating hbase.

And in local mode, the set value of zookeeper.session.timeout in hbase-site.xml 
is replaced with the default value in the startup phase, while the set value is 
used in the run phase and shutdown, and this configuration item is used 
inconsistently in different phases.

 

In pseudo-distributed mode, when zookeeper.session.timeout is set to a negative 
number, in the startup hbase phase, unlike the local mode replaces 
zookeeper.session.timeout with the default value of 10000, which can be checked 
in time for the exception caused by the negative value of the field. The log 
shows that the connection is closed after several attempts to reconnect after 
the client and the collection establish a connection judgment timeout, and then 
the zookeeper client event thread is closed , further leading to the failure to 
create the regionserver and the master node exits.

The phenomenon observed throughout the process is that at first the Hbase 
daemons can be started normally, but after a period of time only the process 
HQuorumPeer remains, and the function cannot be used normally.

 

To summarize, for the effect of setting zookeeper.session.timeout to a negative 
number, hbase can be started normally in local mode, but no valid hbase 
database operations can be initiated, and using the official termination method 
. /stop-hbase.sh does not work to shut down hbase. In pseudo-distributed mode, 
hbase can be started normally but some of the daemons exit after a period of 
time and the functionality cannot be used normally.

In both modes, there is a lack of judgment on the impact of negative 
zookeeper.session.timeout and the related handling mechanism is not perfect, 
and there is a lack of clear and effective information, which makes it 
impossible to locate and handle exceptions in time.


> when “zookeeper.session.timeout” is set to a negative number, HMaster cannot 
> stop normally
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-27016
>                 URL: https://issues.apache.org/jira/browse/HBASE-27016
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.2
>         Environment: HBase 2.2.2
> os.name=Linux
> os.arch=amd64
> os.version=5.4.0-72-generic
> java.version=1.8.0_211
> java.vendor=Oracle Corporation
>            Reporter: Jingxuan Fu
>            Assignee: Jingxuan Fu
>            Priority: Major
>
> {{In hbase- default.xml}}
> {{<property>}}
> {{  <name>zookeeper.session.timeout</name>}}
> {{  <value>90000</value>}}
> {{  <description>ZooKeeper session timeout in milliseconds. It is used in two 
> different ways.}}
> {{      First, this value is used in the ZK client that HBase uses to connect 
> to the ensemble.}}
> {{      It is also used by HBase when it starts a ZK server and it is passed 
> as the 'maxSessionTimeout'.}}
> {{      See 
> [https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions].}}
> {{      For example, if an HBase region server connects to a ZK ensemble 
> that's also managed}}
> {{      by HBase, then the session timeout will be the one specified by this 
> configuration.}}
> {{      But, a region server that connects to an ensemble managed with a 
> different configuration}}
> {{      will be subjected that ensemble's maxSessionTimeout. So, even though 
> HBase might propose}}
> {{      using 90 seconds, the ensemble can have a max timeout lower than this 
> and it will take}}
> {{      precedence. The current default maxSessionTimeout that ZK ships with 
> is 40 seconds, which is lower than}}
> {{      HBase's.}}
> {{  </description>}}
> {{</property>}}
> The impact caused by setting zookeeper.session.timeout to negative number is 
> different in local mode and pseudo-distributed mode.
> After starting hbase in local mode, hmaster will replace the sessiontimeout 
> configured in this start master() with the default value of 10000 when it is 
> judged as local mode, so setting zookeeper.session.timeout to a negative 
> number in hbase-site.xml will have no effect on the hbase startup process.
> private int startMaster() {
> Configuration conf = getConf();
> TraceUtil.initTracer(conf);
> 
> try {
>   // If 'local', defer to LocalHBaseCluster instance. Starts master
>   // and regionserver both in the one JVM.
>   if (LocalHBaseCluster.isLocal(conf))
> {     ...     int localZKClusterSessionTimeout;     
> localZKClusterSessionTimeout = conf.getInt(HConstants.ZK_SESSION_TIMEOUT + 
> ".localHBaseCluster", 10*1000);     
> conf.setInt(HConstants.ZK_SESSION_TIMEOUT, localZKClusterSessionTimeout);     
> LOG.info("Starting a zookeeper cluster");     ...     }
> Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml 
> to start hbase in local mode normally.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh 
> OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
> version 9.0 and will likely be removed in a future release.
> running master, logging to 
> /home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
> OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
> version 9.0 and will likely be removed in a future release.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
> 25667 HMaster
> 21859 Jps
> 2022-05-09 11:16:57,571 INFO [master/ljq1:16000:becomeActiveMaster] 
> master.HMaster: Master has completed initialization 8.179sec
> However, next, when the hbase shell as a client initiates a request for basic 
> database operations to the started HBase, the zookeeper.session.timeout is 
> negative causing a ConnectionLoss error and any requested operation is 
> invalid as it recreates the conf to load the configuration.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
> Use "help" to get list of supported commands.
> Use "exit" to quit this interactive shell.
> For Reference, please visit: [http://hbase.apache.org/2.0/book.html#shell]
> Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12 
> UTC 2019
> Took 0.0212 seconds                                                           
>                                                                               
>                                                                             
> hbase(main):001:0> status
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "status"'
> Took 34.2850 seconds                                                          
>                                                                               
>                                                                             
> hbase(main):002:0> create 'aaaaa','a11111','a222222'
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "create"'
> Took 33.2029 seconds
> The log shows that a socket connection was received, but consistently throws 
> a warning that no additional data could be read from the client session and 
> that the client may have closed the socket.
> Next, use the official termination method . /stop-hbase.sh to shut down 
> hbase, it will also recreate the conf and load the configuration when 
> shutting down. zookeeper.session.timeout will be negative and the terminal 
> will be stuck in an infinite wait, and eventually the process will have to be 
> terminated by kill -9, which is the official termination method . 
> /stop-hbase.sh to shut down hbase is not effective.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh 
> stopping 
> hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
>  stopping stop-master-command with kill -9 25667
> Using the official termination method . /stop-hbase.sh when shutting down 
> hbase, the log shows that the stop request was received as well as the socket 
> connection, but still no additional data can be read from the client session, 
> probably with the warning that the client has closed the socket.
> 2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
> 2022-05-09 14:35:11,588 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
> 2022-05-09 14:35:12,709 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
> 2022-05-09 14:35:13,810 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
> 2022-05-09 14:35:14,912 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
> 2022-05-09 14:35:16,013 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
> 2022-05-09 14:35:17,115 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
> 2022-05-09 14:35:18,217 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
> 2022-05-09 14:35:19,321 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
> ...
> 2022-05-09 14:36:18,846 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
> 2022-05-09 14:36:19,033 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,034 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344 
> (no session established for client)
> 2022-05-09 14:36:19,034 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,035 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342 
> (no session established for client)
> 2022-05-09 14:36:19,035 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] 
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Although, before finally using kill -9 to force a shutdown, the logs show 
> that the reason for not being able to shut down properly may be because the 
> zookeeper client session timed out and the client session did not receive a 
> message from the server within 0ms and was closing the socket connection and 
> trying to reconnect.
> 2022-05-09 14:55:13,486 INFO [main] zookeeper.ZooKeeper: Initiating client 
> connection, connectString=localhost:2181 sessionTimeout=-1 
> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
> 2022-05-09 14:55:13,589 INFO [main-SendThread(localhost:2181)] 
> zookeeper.ClientCnxn: Opening socket connection to server 
> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL 
> (unknown error)
> 2022-05-09 14:55:13,621 WARN [main-SendThread(localhost:2181)] 
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
> 0ms for sessionid 0x0
> 2022-05-09 14:55:13,622 INFO [main-SendThread(localhost:2181)] 
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
> 0ms for sessionid 0x0, closing socket connection and attempting reconnect
> 2022-05-09 14:55:13,737 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
> 2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
> java.lang.IllegalArgumentException: Invalid path string 
> "/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name 
> specified @10
> at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
> at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
> at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
> at 
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
> 2022-05-09 14:55:13,747 INFO [main-EventThread] zookeeper.ClientCnxn: 
> EventThread shut down for session: 0x0
> It can be speculated that the process of shutting down hbase lacks a 
> mechanism to determine and handle the impact of a negative 
> zookeeper.session.timeout, resulting in the process of repeatedly closing the 
> socket connection and trying to reconnect, without actually performing the 
> operation of terminating hbase.
> And in local mode, the set value of zookeeper.session.timeout in 
> hbase-site.xml is replaced with the default value in the startup phase, while 
> the set value is used in the run phase and shutdown, and this configuration 
> item is used inconsistently in different phases.
>  
> In pseudo-distributed mode, when zookeeper.session.timeout is set to a 
> negative number, in the startup hbase phase, unlike the local mode replaces 
> zookeeper.session.timeout with the default value of 10000, which can be 
> checked in time for the exception caused by the negative value of the field. 
> The log shows that the connection is closed after several attempts to 
> reconnect after the client and the collection establish a connection judgment 
> timeout, and then the zookeeper client event thread is closed , further 
> leading to the failure to create the regionserver and the master node exits.
> The phenomenon observed throughout the process is that at first the Hbase 
> daemons can be started normally, but after a period of time only the process 
> HQuorumPeer remains, and the function cannot be used normally.
>  
> To summarize, for the effect of setting zookeeper.session.timeout to a 
> negative number, hbase can be started normally in local mode, but no valid 
> hbase database operations can be initiated, and using the official 
> termination method . /stop-hbase.sh does not work to shut down hbase. In 
> pseudo-distributed mode, hbase can be started normally but some of the 
> daemons exit after a period of time and the functionality cannot be used 
> normally.
> In both modes, there is a lack of judgment on the impact of negative 
> zookeeper.session.timeout and the related handling mechanism is not perfect, 
> and there is a lack of clear and effective information, which makes it 
> impossible to locate and handle exceptions in time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HBASE-27016) when “zookeeper.session.timeout” is set to a negative number, HMaster cannot stop normally

Reply via email to