[jira] [Created] (HDFS-16408) Negetive LeaseRecheckIntervalMs will let LeaseMonitor loop forever and print huge amount of log

Jingxuan Fu (Jira) Sun, 02 Jan 2022 23:23:04 -0800

Jingxuan Fu created HDFS-16408:
----------------------------------

             Summary: Negetive LeaseRecheckIntervalMs will let LeaseMonitor 
loop forever and print huge amount of log
                 Key: HDFS-16408
                 URL: https://issues.apache.org/jira/browse/HDFS-16408
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.3.1, 3.1.3
            Reporter: Jingxuan Fu



There is a problem with the try catch statement in the LeaseMonitor daemon (in 
LeaseManager.java), when an unknown exception is caught, it simply prints a 
warning message and continues with the next loop. 

An extreme case is when the configuration item 
'dfs.namenode.lease-recheck-interval-ms' is accidentally set to a negative 
number by the user, as the configuration item is read without checking its 
range, 'fsnamesystem. getLeaseRecheckIntervalMs()' returns this value and is 
used as an argument to Thread.sleep(). A negative argument will cause 
Thread.sleep() to throw an IllegalArgumentException, which will be caught by 
'catch(Throwable e)' and a warning message will be printed. 

This behavior is repeated for each subsequent loop. This means that a huge 
amount of repetitive messages will be printed to the log file in a short period 
of time, quickly consuming disk space and affecting the operation of the system.

As you can see, 178M log files are generated in one minute.

 
{code:java}
ll logs/
total 174456
drwxrwxr-x  2 hadoop hadoop      4096 1月   3 15:13 ./
drwxr-xr-x 11 hadoop hadoop      4096 1月   3 15:13 ../
-rw-rw-r--  1 hadoop hadoop     36342 1月   3 15:14 
hadoop-hadoop-datanode-ljq1.log
-rw-rw-r--  1 hadoop hadoop      1243 1月   3 15:13 
hadoop-hadoop-datanode-ljq1.out
-rw-rw-r--  1 hadoop hadoop 178545466 1月   3 15:14 
hadoop-hadoop-namenode-ljq1.log
-rw-rw-r--  1 hadoop hadoop       692 1月   3 15:13 
hadoop-hadoop-namenode-ljq1.out
-rw-rw-r--  1 hadoop hadoop     33201 1月   3 15:14 
hadoop-hadoop-secondarynamenode-ljq1.log
-rw-rw-r--  1 hadoop hadoop      3764 1月   3 15:14 
hadoop-hadoop-secondarynamenode-ljq1.out
-rw-rw-r--  1 hadoop hadoop         0 1月   3 15:13 SecurityAuth-hadoop.audit
 
tail -n 15 logs/hadoop-hadoop-namenode-ljq1.log 
2022-01-03 15:14:46,032 WARN 
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable: 
java.lang.IllegalArgumentException: timeout value is negative
        at java.base/java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
        at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-03 15:14:46,033 WARN 
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable: 
java.lang.IllegalArgumentException: timeout value is negative
        at java.base/java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
        at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-03 15:14:46,033 WARN 
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable: 
java.lang.IllegalArgumentException: timeout value is negative
        at java.base/java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
        at java.base/java.lang.Thread.run(Thread.java:829)
{code}
 

I think there are two potential solutions. 

The first is to adjust the position of the try catch statement in the 
LeaseMonitor daemon by moving 'catch(Throwable e)' to the outside of the loop 
body. This can be done like the NameNodeResourceMonitor daemon, which ends the 
thread when an unexpected exception is caught. 

The second is to use Precondition.checkArgument() to scope the configuration 
item 'dfs.namenode.lease-recheck-interval-ms' when it is read, to avoid the 
wrong configuration item can affect the subsequent operation of the program.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDFS-16408) Negetive LeaseRecheckIntervalMs will let LeaseMonitor loop forever and print huge amount of log

Reply via email to