[
https://issues.apache.org/jira/browse/HBASE-27016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jingxuan Fu updated HBASE-27016:
--------------------------------
Description:
{{In hbase- default.xml}}
{{<property>}}
{{ <name>zookeeper.session.timeout</name>}}
{{ <value>90000</value>}}
{{ <description>ZooKeeper session timeout in milliseconds. It is used in two
different ways.}}
{{ First, this value is used in the ZK client that HBase uses to connect
to the ensemble.}}
{{ It is also used by HBase when it starts a ZK server and it is passed as
the 'maxSessionTimeout'.}}
{{ See
[https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions].}}
{{ For example, if an HBase region server connects to a ZK ensemble that's
also managed}}
{{ by HBase, then the session timeout will be the one specified by this
configuration.}}
{{ But, a region server that connects to an ensemble managed with a
different configuration}}
{{ will be subjected that ensemble's maxSessionTimeout. So, even though
HBase might propose}}
{{ using 90 seconds, the ensemble can have a max timeout lower than this
and it will take}}
{{ precedence. The current default maxSessionTimeout that ZK ships with is
40 seconds, which is lower than}}
{{ HBase's.}}
{{ </description>}}
{{</property>}}
The impact caused by setting zookeeper.session.timeout to negative number is
different in local mode and pseudo-distributed mode.
After starting hbase in local mode, hmaster will replace the sessiontimeout
configured in this start master() with the default value of 10000 when it is
judged as local mode, so setting zookeeper.session.timeout to a negative number
in hbase-site.xml will have no effect on the hbase startup process.
private int startMaster() {
Configuration conf = getConf();
TraceUtil.initTracer(conf);
try {
// If 'local', defer to LocalHBaseCluster instance. Starts master
// and regionserver both in the one JVM.
if (LocalHBaseCluster.isLocal(conf))
{ ... int localZKClusterSessionTimeout;
localZKClusterSessionTimeout = conf.getInt(HConstants.ZK_SESSION_TIMEOUT +
".localHBaseCluster", 10*1000); conf.setInt(HConstants.ZK_SESSION_TIMEOUT,
localZKClusterSessionTimeout); LOG.info("Starting a zookeeper cluster");
... }
Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml
to start hbase in local mode normally.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in
version 9.0 and will likely be removed in a future release.
running master, logging to
/home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in
version 9.0 and will likely be removed in a future release.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
25667 HMaster
21859 Jps
2022-05-09 11:16:57,571 INFO [master/ljq1:16000:becomeActiveMaster]
master.HMaster: Master has completed initialization 8.179sec
However, next, when the hbase shell as a client initiates a request for basic
database operations to the started HBase, the zookeeper.session.timeout is
negative causing a ConnectionLoss error and any requested operation is invalid
as it recreates the conf to load the configuration.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: [http://hbase.apache.org/2.0/book.html#shell]
Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12
UTC 2019
Took 0.0212 seconds
hbase(main):001:0> status
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "status"'
Took 34.2850 seconds
hbase(main):002:0> create 'aaaaa','a11111','a222222'
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "create"'
Took 33.2029 seconds
The log shows that a socket connection was received, but consistently throws a
warning that no additional data could be read from the client session and that
the client may have closed the socket.
Next, use the official termination method . /stop-hbase.sh to shut down hbase,
it will also recreate the conf and load the configuration when shutting down.
zookeeper.session.timeout will be negative and the terminal will be stuck in an
infinite wait, and eventually the process will have to be terminated by kill
-9, which is the official termination method . /stop-hbase.sh to shut down
hbase is not effective.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh
stopping
hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
stopping stop-master-command with kill -9 25667
Using the official termination method . /stop-hbase.sh when shutting down
hbase, the log shows that the stop request was received as well as the socket
connection, but still no additional data can be read from the client session,
probably with the warning that the client has closed the socket.
2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
2022-05-09 14:35:11,588 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
2022-05-09 14:35:12,709 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
2022-05-09 14:35:13,810 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
2022-05-09 14:35:14,912 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
2022-05-09 14:35:16,013 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
2022-05-09 14:35:17,115 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
2022-05-09 14:35:18,217 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
2022-05-09 14:35:19,321 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
...
2022-05-09 14:36:18,846 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
2022-05-09 14:36:19,033 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,034 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344 (no
session established for client)
2022-05-09 14:36:19,034 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,035 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342 (no
session established for client)
2022-05-09 14:36:19,035 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
Although, before finally using kill -9 to force a shutdown, the logs show that
the reason for not being able to shut down properly may be because the
zookeeper client session timed out and the client session did not receive a
message from the server within 0ms and was closing the socket connection and
trying to reconnect.
2022-05-09 14:55:13,486 INFO [main] zookeeper.ZooKeeper: Initiating client
connection, connectString=localhost:2181 sessionTimeout=-1
watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
2022-05-09 14:55:13,589 INFO [main-SendThread(localhost:2181)]
zookeeper.ClientCnxn: Opening socket connection to server
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown
error)
2022-05-09 14:55:13,621 WARN [main-SendThread(localhost:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
0ms for sessionid 0x0
2022-05-09 14:55:13,622 INFO [main-SendThread(localhost:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
0ms for sessionid 0x0, closing socket connection and attempting reconnect
2022-05-09 14:55:13,737 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
java.lang.IllegalArgumentException: Invalid path string
"/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name
specified @10
at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
2022-05-09 14:55:13,747 INFO [main-EventThread] zookeeper.ClientCnxn:
EventThread shut down for session: 0x0
It can be speculated that the process of shutting down hbase lacks a mechanism
to determine and handle the impact of a negative zookeeper.session.timeout,
resulting in the process of repeatedly closing the socket connection and trying
to reconnect, without actually performing the operation of terminating hbase.
And in local mode, the set value of zookeeper.session.timeout in hbase-site.xml
is replaced with the default value in the startup phase, while the set value is
used in the run phase and shutdown, and this configuration item is used
inconsistently in different phases.
In pseudo-distributed mode, when zookeeper.session.timeout is set to a negative
number, in the startup hbase phase, unlike the local mode replaces
zookeeper.session.timeout with the default value of 10000, which can be checked
in time for the exception caused by the negative value of the field. The log
shows that the connection is closed after several attempts to reconnect after
the client and the collection establish a connection judgment timeout, and then
the zookeeper client event thread is closed , further leading to the failure to
create the regionserver and the master node exits.
The phenomenon observed throughout the process is that at first the Hbase
daemons can be started normally, but after a period of time only the process
HQuorumPeer remains, and the function cannot be used normally.
To summarize, for the effect of setting zookeeper.session.timeout to a negative
number, hbase can be started normally in local mode, but no valid hbase
database operations can be initiated, and using the official termination method
. /stop-hbase.sh does not work to shut down hbase. In pseudo-distributed mode,
hbase can be started normally but some of the daemons exit after a period of
time and the functionality cannot be used normally.
In both modes, there is a lack of judgment on the impact of negative
zookeeper.session.timeout and the related handling mechanism is not perfect,
and there is a lack of clear and effective information, which makes it
impossible to locate and handle exceptions in time.
was:
In hbase- default.xml
<property>
<name>zookeeper.session.timeout</name>
<value>90000</value>
<description>ZooKeeper session timeout in milliseconds. It is used in two
different ways.
First, this value is used in the ZK client that HBase uses to connect
to the ensemble.
It is also used by HBase when it starts a ZK server and it is passed as
the 'maxSessionTimeout'.
See
https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions.
For example, if an HBase region server connects to a ZK ensemble that's
also managed
by HBase, then the session timeout will be the one specified by this
configuration.
But, a region server that connects to an ensemble managed with a
different configuration
will be subjected that ensemble's maxSessionTimeout. So, even though
HBase might propose
using 90 seconds, the ensemble can have a max timeout lower than this
and it will take
precedence. The current default maxSessionTimeout that ZK ships with is
40 seconds, which is lower than
HBase's.
</description>
</property>
The impact caused by setting zookeeper.session.timeout to negative number is
different in local mode and pseudo-distributed mode.
After starting hbase in local mode, hmaster will replace the sessiontimeout
configured in this start master() with the default value of 10000 when it is
judged as local mode, so setting zookeeper.session.timeout to a negative number
in hbase-site.xml will have no effect on the hbase startup process.
private int startMaster() {
Configuration conf = getConf();
TraceUtil.initTracer(conf);
try {
// If 'local', defer to LocalHBaseCluster instance. Starts master
// and regionserver both in the one JVM.
if (LocalHBaseCluster.isLocal(conf)) {
...
int localZKClusterSessionTimeout;
localZKClusterSessionTimeout = conf.getInt(HConstants.ZK_SESSION_TIMEOUT
+ ".localHBaseCluster", 10*1000);
conf.setInt(HConstants.ZK_SESSION_TIMEOUT, localZKClusterSessionTimeout);
LOG.info("Starting a zookeeper cluster");
...
}
Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml
to start hbase in local mode normally.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in
version 9.0 and will likely be removed in a future release.
running master, logging to
/home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in
version 9.0 and will likely be removed in a future release.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
25667 HMaster
21859 Jps
2022-05-09 11:16:57,571 INFO [master/ljq1:16000:becomeActiveMaster]
master.HMaster: Master has completed initialization 8.179sec
However, next, when the hbase shell as a client initiates a request for basic
database operations to the started HBase, the zookeeper.session.timeout is
negative causing a ConnectionLoss error and any requested operation is invalid
as it recreates the conf to load the configuration.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12
UTC 2019
Took 0.0212 seconds
hbase(main):001:0> status
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "status"'
Took 34.2850 seconds
hbase(main):002:0> create 'aaaaa','a11111','a222222'
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "create"'
Took 33.2029 seconds
The log shows that a socket connection was received, but consistently throws a
warning that no additional data could be read from the client session and that
the client may have closed the socket.
Next, use the official termination method . /stop-hbase.sh to shut down hbase,
it will also recreate the conf and load the configuration when shutting down.
zookeeper.session.timeout will be negative and the terminal will be stuck in an
infinite wait, and eventually the process will have to be terminated by kill
-9, which is the official termination method . /stop-hbase.sh to shut down
hbase is not effective.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh
stopping
hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
stopping stop-master-command with kill -9 25667
Using the official termination method . /stop-hbase.sh when shutting down
hbase, the log shows that the stop request was received as well as the socket
connection, but still no additional data can be read from the client session,
probably with the warning that the client has closed the socket.
2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
2022-05-09 14:35:11,588 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
2022-05-09 14:35:12,709 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
2022-05-09 14:35:13,810 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
2022-05-09 14:35:14,912 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
2022-05-09 14:35:16,013 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
2022-05-09 14:35:17,115 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
2022-05-09 14:35:18,217 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
2022-05-09 14:35:19,321 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
...
2022-05-09 14:36:18,846 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
2022-05-09 14:36:19,033 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,034 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344 (no
session established for client)
2022-05-09 14:36:19,034 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,035 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342 (no
session established for client)
2022-05-09 14:36:19,035 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.base/java.lang.Thread.run(Thread.java:829)
Although, before finally using kill -9 to force a shutdown, the logs show that
the reason for not being able to shut down properly may be because the
zookeeper client session timed out and the client session did not receive a
message from the server within 0ms and was closing the socket connection and
trying to reconnect.
2022-05-09 14:55:13,486 INFO [main] zookeeper.ZooKeeper: Initiating client
connection, connectString=localhost:2181 sessionTimeout=-1
watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
2022-05-09 14:55:13,589 INFO [main-SendThread(localhost:2181)]
zookeeper.ClientCnxn: Opening socket connection to server
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown
error)
2022-05-09 14:55:13,621 WARN [main-SendThread(localhost:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
0ms for sessionid 0x0
2022-05-09 14:55:13,622 INFO [main-SendThread(localhost:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
0ms for sessionid 0x0, closing socket connection and attempting reconnect
2022-05-09 14:55:13,737 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
java.lang.IllegalArgumentException: Invalid path string
"/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name
specified @10
at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
2022-05-09 14:55:13,747 INFO [main-EventThread] zookeeper.ClientCnxn:
EventThread shut down for session: 0x0
It can be speculated that the process of shutting down hbase lacks a mechanism
to determine and handle the impact of a negative zookeeper.session.timeout,
resulting in the process of repeatedly closing the socket connection and trying
to reconnect, without actually performing the operation of terminating hbase.
And in local mode, the set value of zookeeper.session.timeout in hbase-site.xml
is replaced with the default value in the startup phase, while the set value is
used in the run phase and shutdown, and this configuration item is used
inconsistently in different phases.
In pseudo-distributed mode, when zookeeper.session.timeout is set to a negative
number, in the startup hbase phase, unlike the local mode replaces
zookeeper.session.timeout with the default value of 10000, which can be checked
in time for the exception caused by the negative value of the field. The log
shows that the connection is closed after several attempts to reconnect after
the client and the collection establish a connection judgment timeout, and then
the zookeeper client event thread is closed , further leading to the failure to
create the regionserver and the master node exits.
The phenomenon observed throughout the process is that at first the Hbase
daemons can be started normally, but after a period of time only the process
HQuorumPeer remains, and the function cannot be used normally.
To summarize, for the effect of setting zookeeper.session.timeout to a negative
number, hbase can be started normally in local mode, but no valid hbase
database operations can be initiated, and using the official termination method
. /stop-hbase.sh does not work to shut down hbase. In pseudo-distributed mode,
hbase can be started normally but some of the daemons exit after a period of
time and the functionality cannot be used normally.
In both modes, there is a lack of judgment on the impact of negative
zookeeper.session.timeout and the related handling mechanism is not perfect,
and there is a lack of clear and effective information, which makes it
impossible to locate and handle exceptions in time.
> when “zookeeper.session.timeout” is set to a negative number, HMaster cannot
> stop normally
> ------------------------------------------------------------------------------------------
>
> Key: HBASE-27016
> URL: https://issues.apache.org/jira/browse/HBASE-27016
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.2.2
> Environment: HBase 2.2.2
> os.name=Linux
> os.arch=amd64
> os.version=5.4.0-72-generic
> java.version=1.8.0_211
> java.vendor=Oracle Corporation
> Reporter: Jingxuan Fu
> Assignee: Jingxuan Fu
> Priority: Major
>
> {{In hbase- default.xml}}
> {{<property>}}
> {{ <name>zookeeper.session.timeout</name>}}
> {{ <value>90000</value>}}
> {{ <description>ZooKeeper session timeout in milliseconds. It is used in two
> different ways.}}
> {{ First, this value is used in the ZK client that HBase uses to connect
> to the ensemble.}}
> {{ It is also used by HBase when it starts a ZK server and it is passed
> as the 'maxSessionTimeout'.}}
> {{ See
> [https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions].}}
> {{ For example, if an HBase region server connects to a ZK ensemble
> that's also managed}}
> {{ by HBase, then the session timeout will be the one specified by this
> configuration.}}
> {{ But, a region server that connects to an ensemble managed with a
> different configuration}}
> {{ will be subjected that ensemble's maxSessionTimeout. So, even though
> HBase might propose}}
> {{ using 90 seconds, the ensemble can have a max timeout lower than this
> and it will take}}
> {{ precedence. The current default maxSessionTimeout that ZK ships with
> is 40 seconds, which is lower than}}
> {{ HBase's.}}
> {{ </description>}}
> {{</property>}}
> The impact caused by setting zookeeper.session.timeout to negative number is
> different in local mode and pseudo-distributed mode.
> After starting hbase in local mode, hmaster will replace the sessiontimeout
> configured in this start master() with the default value of 10000 when it is
> judged as local mode, so setting zookeeper.session.timeout to a negative
> number in hbase-site.xml will have no effect on the hbase startup process.
> private int startMaster() {
> Configuration conf = getConf();
> TraceUtil.initTracer(conf);
>
> try {
> // If 'local', defer to LocalHBaseCluster instance. Starts master
> // and regionserver both in the one JVM.
> if (LocalHBaseCluster.isLocal(conf))
> { ... int localZKClusterSessionTimeout;
> localZKClusterSessionTimeout = conf.getInt(HConstants.ZK_SESSION_TIMEOUT +
> ".localHBaseCluster", 10*1000);
> conf.setInt(HConstants.ZK_SESSION_TIMEOUT, localZKClusterSessionTimeout);
> LOG.info("Starting a zookeeper cluster"); ... }
> Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml
> to start hbase in local mode normally.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh
> OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in
> version 9.0 and will likely be removed in a future release.
> running master, logging to
> /home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
> OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in
> version 9.0 and will likely be removed in a future release.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
> 25667 HMaster
> 21859 Jps
> 2022-05-09 11:16:57,571 INFO [master/ljq1:16000:becomeActiveMaster]
> master.HMaster: Master has completed initialization 8.179sec
> However, next, when the hbase shell as a client initiates a request for basic
> database operations to the started HBase, the zookeeper.session.timeout is
> negative causing a ConnectionLoss error and any requested operation is
> invalid as it recreates the conf to load the configuration.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
> Use "help" to get list of supported commands.
> Use "exit" to quit this interactive shell.
> For Reference, please visit: [http://hbase.apache.org/2.0/book.html#shell]
> Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12
> UTC 2019
> Took 0.0212 seconds
>
>
> hbase(main):001:0> status
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "status"'
> Took 34.2850 seconds
>
>
> hbase(main):002:0> create 'aaaaa','a11111','a222222'
> ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
> For usage try 'help "create"'
> Took 33.2029 seconds
> The log shows that a socket connection was received, but consistently throws
> a warning that no additional data could be read from the client session and
> that the client may have closed the socket.
> Next, use the official termination method . /stop-hbase.sh to shut down
> hbase, it will also recreate the conf and load the configuration when
> shutting down. zookeeper.session.timeout will be negative and the terminal
> will be stuck in an infinite wait, and eventually the process will have to be
> terminated by kill -9, which is the official termination method .
> /stop-hbase.sh to shut down hbase is not effective.
> hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh
> stopping
> hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force
> stopping stop-master-command with kill -9 25667
> Using the official termination method . /stop-hbase.sh when shutting down
> hbase, the log shows that the stop request was received as well as the socket
> connection, but still no additional data can be read from the client session,
> probably with the warning that the client has closed the socket.
> 2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
> 2022-05-09 14:35:11,588 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
> 2022-05-09 14:35:12,709 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
> 2022-05-09 14:35:13,810 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
> 2022-05-09 14:35:14,912 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
> 2022-05-09 14:35:16,013 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
> 2022-05-09 14:35:17,115 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
> 2022-05-09 14:35:18,217 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
> 2022-05-09 14:35:19,321 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
> ...
> 2022-05-09 14:36:18,846 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
> 2022-05-09 14:36:19,033 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,034 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344
> (no session established for client)
> 2022-05-09 14:36:19,034 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-05-09 14:36:19,035 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342
> (no session established for client)
> 2022-05-09 14:36:19,035 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181]
> server.NIOServerCnxn: caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x0, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Although, before finally using kill -9 to force a shutdown, the logs show
> that the reason for not being able to shut down properly may be because the
> zookeeper client session timed out and the client session did not receive a
> message from the server within 0ms and was closing the socket connection and
> trying to reconnect.
> 2022-05-09 14:55:13,486 INFO [main] zookeeper.ZooKeeper: Initiating client
> connection, connectString=localhost:2181 sessionTimeout=-1
> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
> 2022-05-09 14:55:13,589 INFO [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server
> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
> (unknown error)
> 2022-05-09 14:55:13,621 WARN [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in
> 0ms for sessionid 0x0
> 2022-05-09 14:55:13,622 INFO [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in
> 0ms for sessionid 0x0, closing socket connection and attempting reconnect
> 2022-05-09 14:55:13,737 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
> 2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
> java.lang.IllegalArgumentException: Invalid path string
> "/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name
> specified @10
> at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
> at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
> at
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
> 2022-05-09 14:55:13,747 INFO [main-EventThread] zookeeper.ClientCnxn:
> EventThread shut down for session: 0x0
> It can be speculated that the process of shutting down hbase lacks a
> mechanism to determine and handle the impact of a negative
> zookeeper.session.timeout, resulting in the process of repeatedly closing the
> socket connection and trying to reconnect, without actually performing the
> operation of terminating hbase.
> And in local mode, the set value of zookeeper.session.timeout in
> hbase-site.xml is replaced with the default value in the startup phase, while
> the set value is used in the run phase and shutdown, and this configuration
> item is used inconsistently in different phases.
>
> In pseudo-distributed mode, when zookeeper.session.timeout is set to a
> negative number, in the startup hbase phase, unlike the local mode replaces
> zookeeper.session.timeout with the default value of 10000, which can be
> checked in time for the exception caused by the negative value of the field.
> The log shows that the connection is closed after several attempts to
> reconnect after the client and the collection establish a connection judgment
> timeout, and then the zookeeper client event thread is closed , further
> leading to the failure to create the regionserver and the master node exits.
> The phenomenon observed throughout the process is that at first the Hbase
> daemons can be started normally, but after a period of time only the process
> HQuorumPeer remains, and the function cannot be used normally.
>
> To summarize, for the effect of setting zookeeper.session.timeout to a
> negative number, hbase can be started normally in local mode, but no valid
> hbase database operations can be initiated, and using the official
> termination method . /stop-hbase.sh does not work to shut down hbase. In
> pseudo-distributed mode, hbase can be started normally but some of the
> daemons exit after a period of time and the functionality cannot be used
> normally.
> In both modes, there is a lack of judgment on the impact of negative
> zookeeper.session.timeout and the related handling mechanism is not perfect,
> and there is a lack of clear and effective information, which makes it
> impossible to locate and handle exceptions in time.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)