[jira] [Work logged] (HDFS-16408) Negative LeaseRecheckIntervalMs will let LeaseMonitor loop forever and print huge amount of log

ASF GitHub Bot (Jira) Tue, 04 Jan 2022 05:50:06 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-16408?focusedWorklogId=703306&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-703306
 ]


ASF GitHub Bot logged work on HDFS-16408:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Jan/22 13:49
            Start Date: 04/Jan/22 13:49
    Worklog Time Spent: 10m 
      Work Description: sodonnel commented on a change in pull request #3853:
URL: https://github.com/apache/hadoop/pull/3853#discussion_r778094499



##########
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
##########
@@ -530,35 +530,37 @@ public void setLeasePeriod(long softLimit, long 
hardLimit) {
     /** Check leases periodically. */
     @Override
     public void run() {
-      for(; shouldRunMonitor && fsnamesystem.isRunning(); ) {
-        boolean needSync = false;
-        try {
-          // sleep now to avoid infinite loop if an exception was thrown.
-          Thread.sleep(fsnamesystem.getLeaseRecheckIntervalMs());
-
-          // pre-filter the leases w/o the fsn lock.
-          Collection<Lease> candidates = getExpiredCandidateLeases();
-          if (candidates.isEmpty()) {
-            continue;
-          }
-
-          fsnamesystem.writeLockInterruptibly();
+      try {
+        for (; shouldRunMonitor && fsnamesystem.isRunning(); ) {
+          boolean needSync = false;
           try {
-            if (!fsnamesystem.isInSafeMode()) {
-              needSync = checkLeases(candidates);
+            // sleep now to avoid infinite loop if an exception was thrown.
+            Thread.sleep(fsnamesystem.getLeaseRecheckIntervalMs());
+
+            // pre-filter the leases w/o the fsn lock.
+            Collection<Lease> candidates = getExpiredCandidateLeases();
+            if (candidates.isEmpty()) {
+              continue;
             }
-          } finally {
-            fsnamesystem.writeUnlock("leaseManager");
-            // lease reassignments should to be sync'ed.
-            if (needSync) {
-              fsnamesystem.getEditLog().logSync();
+
+            fsnamesystem.writeLockInterruptibly();
+            try {
+              if (!fsnamesystem.isInSafeMode()) {
+                needSync = checkLeases(candidates);
+              }
+            } finally {
+              fsnamesystem.writeUnlock("leaseManager");
+              // lease reassignments should to be sync'ed.
+              if (needSync) {
+                fsnamesystem.getEditLog().logSync();
+              }
             }
+          } catch (InterruptedException ie) {
+            LOG.debug("{} is interrupted", name, ie);
           }
-        } catch(InterruptedException ie) {
-          LOG.debug("{} is interrupted", name, ie);
-        } catch(Throwable e) {
-          LOG.warn("Unexpected throwable: ", e);
         }
+      } catch (Throwable e) {

Review comment:
       The NamenodeResoureMonitor thread would also exit if it encounters an 
unexpected exception, and that is probably not good. However the 
ResourceMonitor is less critical than the LeaseMonitor. Ideally, there should 
be a bound check on any configuration to prevent illegal values.
   
   I do not have any examples, but it may be possible that some transient issue 
can cause some exception in the LeaseManager, and it would recover on a 
subsequent run, but if we keep the change here, then the thread will exit and 
no more runs will be attempted. In my opinion, it is better that the exception 
is caught, and that it retries again, logging the error each time. If it never 
recovers, then the namenode will need restarted after fixing the underlying 
issue.
   
   If you check `NameNodeEditLogRoller` inside FSNameSystem, you can see it 
catches its exceptions inside the run loop and will try again on failure after 
the sleep interval.
   
   I think we should revert this part of the change, but keep the new range 
check you added instead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 703306)
    Remaining Estimate: 20m  (was: 0.5h)
            Time Spent: 40m  (was: 0.5h)

> Negative LeaseRecheckIntervalMs will let LeaseMonitor loop forever and print 
> huge amount of log
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16408
>                 URL: https://issues.apache.org/jira/browse/HDFS-16408
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.1.3, 3.3.1
>            Reporter: Jingxuan Fu
>            Priority: Major
>              Labels: pull-request-available
>   Original Estimate: 1h
>          Time Spent: 40m
>  Remaining Estimate: 20m
>
> There is a problem with the try catch statement in the LeaseMonitor daemon 
> (in LeaseManager.java), when an unknown exception is caught, it simply prints 
> a warning message and continues with the next loop. 
> An extreme case is when the configuration item 
> 'dfs.namenode.lease-recheck-interval-ms' is accidentally set to a negative 
> number by the user, as the configuration item is read without checking its 
> range, 'fsnamesystem. getLeaseRecheckIntervalMs()' returns this value and is 
> used as an argument to Thread.sleep(). A negative argument will cause 
> Thread.sleep() to throw an IllegalArgumentException, which will be caught by 
> 'catch(Throwable e)' and a warning message will be printed. 
> This behavior is repeated for each subsequent loop. This means that a huge 
> amount of repetitive messages will be printed to the log file in a short 
> period of time, quickly consuming disk space and affecting the operation of 
> the system.
> As you can see, 178M log files are generated in one minute.
>  
> {code:java}
> ll logs/
> total 174456
> drwxrwxr-x  2 hadoop hadoop      4096 1月   3 15:13 ./
> drwxr-xr-x 11 hadoop hadoop      4096 1月   3 15:13 ../
> -rw-rw-r--  1 hadoop hadoop     36342 1月   3 15:14 
> hadoop-hadoop-datanode-ljq1.log
> -rw-rw-r--  1 hadoop hadoop      1243 1月   3 15:13 
> hadoop-hadoop-datanode-ljq1.out
> -rw-rw-r--  1 hadoop hadoop 178545466 1月   3 15:14 
> hadoop-hadoop-namenode-ljq1.log
> -rw-rw-r--  1 hadoop hadoop       692 1月   3 15:13 
> hadoop-hadoop-namenode-ljq1.out
> -rw-rw-r--  1 hadoop hadoop     33201 1月   3 15:14 
> hadoop-hadoop-secondarynamenode-ljq1.log
> -rw-rw-r--  1 hadoop hadoop      3764 1月   3 15:14 
> hadoop-hadoop-secondarynamenode-ljq1.out
> -rw-rw-r--  1 hadoop hadoop         0 1月   3 15:13 SecurityAuth-hadoop.audit
>  
> tail -n 15 logs/hadoop-hadoop-namenode-ljq1.log 
> 2022-01-03 15:14:46,032 WARN 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable: 
> java.lang.IllegalArgumentException: timeout value is negative
>         at java.base/java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
>         at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-01-03 15:14:46,033 WARN 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable: 
> java.lang.IllegalArgumentException: timeout value is negative
>         at java.base/java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
>         at java.base/java.lang.Thread.run(Thread.java:829)
> 2022-01-03 15:14:46,033 WARN 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable: 
> java.lang.IllegalArgumentException: timeout value is negative
>         at java.base/java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:534)
>         at java.base/java.lang.Thread.run(Thread.java:829)
> {code}
>  
> I think there are two potential solutions. 
> The first is to adjust the position of the try catch statement in the 
> LeaseMonitor daemon by moving 'catch(Throwable e)' to the outside of the loop 
> body. This can be done like the NameNodeResourceMonitor daemon, which ends 
> the thread when an unexpected exception is caught. 
> The second is to use Precondition.checkArgument() to scope the configuration 
> item 'dfs.namenode.lease-recheck-interval-ms' when it is read, to avoid the 
> wrong configuration item can affect the subsequent operation of the program.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16408) Negative LeaseRecheckIntervalMs will let LeaseMonitor loop forever and print huge amount of log

Reply via email to