Jingxuan Fu created HDFS-16408:
----------------------------------
Summary: Negetive LeaseRecheckIntervalMs will let LeaseMonitor
loop forever and print huge amount of log
Key: HDFS-16408
URL: https://issues.apache.org/jira/browse/HDFS-16408
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 3.3.1, 3.1.3
Reporter: Jingxuan Fu
There is a problem with the try catch statement in the LeaseMonitor daemon (in
LeaseManager.java), when an unknown exception is caught, it simply prints a
warning message and continues with the next loop.
An extreme case is when the configuration item
'dfs.namenode.lease-recheck-interval-ms' is accidentally set to a negative
number by the user, as the configuration item is read without checking its
range, 'fsnamesystem. getLeaseRecheckIntervalMs()' returns this value and is
used as an argument to Thread.sleep(). A negative argument will cause
Thread.sleep() to throw an IllegalArgumentException, which will be caught by
'catch(Throwable e)' and a warning message will be printed.
This behavior is repeated for each subsequent loop. This means that a huge
amount of repetitive messages will be printed to the log file in a short period
of time, quickly consuming disk space and affecting the operation of the system.
As you can see, 178M log files are generated in one minute.
{code:java}
ll logs/
total 174456
drwxrwxr-x 2 hadoop hadoop 4096 1月 3 15:13 ./
drwxr-xr-x 11 hadoop hadoop 4096 1月 3 15:13 ../
-rw-rw-r-- 1 hadoop hadoop 36342 1月 3 15:14
hadoop-hadoop-datanode-ljq1.log
-rw-rw-r-- 1 hadoop hadoop 1243 1月 3 15:13
hadoop-hadoop-datanode-ljq1.out
-rw-rw-r-- 1 hadoop hadoop 178545466 1月 3 15:14
hadoop-hadoop-namenode-ljq1.log
-rw-rw-r-- 1 hadoop hadoop 692 1月 3 15:13
hadoop-hadoop-namenode-ljq1.out
-rw-rw-r-- 1 hadoop hadoop 33201 1月 3 15:14
hadoop-hadoop-secondarynamenode-ljq1.log
-rw-rw-r-- 1 hadoop hadoop 3764 1月 3 15:14
hadoop-hadoop-secondarynamenode-ljq1.out
-rw-rw-r-- 1 hadoop hadoop 0 1月 3 15:13 SecurityAuth-hadoop.audit
tail -n 15 logs/hadoop-hadoop-namenode-ljq1.log
2022-01-03 15:14:46,032 WARN
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable:
java.lang.IllegalArgumentException: timeout value is negative
at java.base/java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-03 15:14:46,033 WARN
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable:
java.lang.IllegalArgumentException: timeout value is negative
at java.base/java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-03 15:14:46,033 WARN
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable:
java.lang.IllegalArgumentException: timeout value is negative
at java.base/java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
at java.base/java.lang.Thread.run(Thread.java:829)
{code}
I think there are two potential solutions.
The first is to adjust the position of the try catch statement in the
LeaseMonitor daemon by moving 'catch(Throwable e)' to the outside of the loop
body. This can be done like the NameNodeResourceMonitor daemon, which ends the
thread when an unexpected exception is caught.
The second is to use Precondition.checkArgument() to scope the configuration
item 'dfs.namenode.lease-recheck-interval-ms' when it is read, to avoid the
wrong configuration item can affect the subsequent operation of the program.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]