[
https://issues.apache.org/jira/browse/HDFS-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDFS-16408:
----------------------------------
Labels: pull-request-available (was: )
> Negative LeaseRecheckIntervalMs will let LeaseMonitor loop forever and print
> huge amount of log
> -----------------------------------------------------------------------------------------------
>
> Key: HDFS-16408
> URL: https://issues.apache.org/jira/browse/HDFS-16408
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.1.3, 3.3.1
> Reporter: Jingxuan Fu
> Priority: Major
> Labels: pull-request-available
> Original Estimate: 1h
> Time Spent: 10m
> Remaining Estimate: 50m
>
> There is a problem with the try catch statement in the LeaseMonitor daemon
> (in LeaseManager.java), when an unknown exception is caught, it simply prints
> a warning message and continues with the next loop.
> An extreme case is when the configuration item
> 'dfs.namenode.lease-recheck-interval-ms' is accidentally set to a negative
> number by the user, as the configuration item is read without checking its
> range, 'fsnamesystem. getLeaseRecheckIntervalMs()' returns this value and is
> used as an argument to Thread.sleep(). A negative argument will cause
> Thread.sleep() to throw an IllegalArgumentException, which will be caught by
> 'catch(Throwable e)' and a warning message will be printed.
> This behavior is repeated for each subsequent loop. This means that a huge
> amount of repetitive messages will be printed to the log file in a short
> period of time, quickly consuming disk space and affecting the operation of
> the system.
> As you can see, 178M log files are generated in one minute.
>
> {code:java}
> ll logs/
> total 174456
> drwxrwxr-x 2 hadoop hadoop 4096 1月 3 15:13 ./
> drwxr-xr-x 11 hadoop hadoop 4096 1月 3 15:13 ../
> -rw-rw-r-- 1 hadoop hadoop 36342 1月 3 15:14
> hadoop-hadoop-datanode-ljq1.log
> -rw-rw-r-- 1 hadoop hadoop 1243 1月 3 15:13
> hadoop-hadoop-datanode-ljq1.out
> -rw-rw-r-- 1 hadoop hadoop 178545466 1月 3 15:14
> hadoop-hadoop-namenode-ljq1.log
> -rw-rw-r-- 1 hadoop hadoop 692 1月 3 15:13
> hadoop-hadoop-namenode-ljq1.out
> -rw-rw-r-- 1 hadoop hadoop 33201 1月 3 15:14
> hadoop-hadoop-secondarynamenode-ljq1.log
> -rw-rw-r-- 1 hadoop hadoop 3764 1月 3 15:14
> hadoop-hadoop-secondarynamenode-ljq1.out
> -rw-rw-r-- 1 hadoop hadoop 0 1月 3 15:13 SecurityAuth-hadoop.audit
>
> tail -n 15 logs/hadoop-hadoop-namenode-ljq1.log
> 2022-01-03 15:14:46,032 WARN
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable:
> java.lang.IllegalArgumentException: timeout value is negative
> at java.base/java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-01-03 15:14:46,033 WARN
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable:
> java.lang.IllegalArgumentException: timeout value is negative
> at java.base/java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-01-03 15:14:46,033 WARN
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable:
> java.lang.IllegalArgumentException: timeout value is negative
> at java.base/java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {code}
>
> I think there are two potential solutions.
> The first is to adjust the position of the try catch statement in the
> LeaseMonitor daemon by moving 'catch(Throwable e)' to the outside of the loop
> body. This can be done like the NameNodeResourceMonitor daemon, which ends
> the thread when an unexpected exception is caught.
> The second is to use Precondition.checkArgument() to scope the configuration
> item 'dfs.namenode.lease-recheck-interval-ms' when it is read, to avoid the
> wrong configuration item can affect the subsequent operation of the program.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]