[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710955#comment-13710955 ] Hudson commented on HADOOP-9618: SUCCESS: Integrated in Hadoop-Yarn-trunk #273 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/273/]) HADOOP-9618. thread which detects GC pauses (Todd Lipcon via Colin Patrick McCabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 2.2.0 Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711062#comment-13711062 ] Hudson commented on HADOOP-9618: FAILURE: Integrated in Hadoop-Hdfs-trunk #1463 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1463/]) HADOOP-9618. thread which detects GC pauses (Todd Lipcon via Colin Patrick McCabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 2.2.0 Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711109#comment-13711109 ] Hudson commented on HADOOP-9618: SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1490 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1490/]) HADOOP-9618. thread which detects GC pauses (Todd Lipcon via Colin Patrick McCabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 2.2.0 Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709997#comment-13709997 ] Hudson commented on HADOOP-9618: SUCCESS: Integrated in Hadoop-trunk-Commit #4091 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4091/]) HADOOP-9618. thread which detects GC pauses (Todd Lipcon via Colin Patrick McCabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699112#comment-13699112 ] Colin Patrick McCabe commented on HADOOP-9618: -- This kind of info will be really useful in debugging. +1. If there are no more comments, I'll commit in a day or two. Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699231#comment-13699231 ] Aaron T. Myers commented on HADOOP-9618: Patch looks pretty good to me as well. My only suggestion would be to also consider putting one of these in the JN, ZKFC, and SecondaryNameNode daemons as well. Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677214#comment-13677214 ] Todd Lipcon commented on HADOOP-9618: - Hey Hitesh. We already have the EventCounter log4j appender. Rather than making one-off metrics for this, I think we should just extend the EventCounter to take a list of regular expressions to map to metrics in the logs - eg you could say something the following in the log4j configuration: {code} log4j.appender.EventCounter.WARN.gc-pauses=Detected pause in JVM {code} Does that seem like a more general way of achieving the above? bq. FWIW, -XX:UseGCLogFileRotation is available in JDK 6u34 and 7u2+. Thanks, I forgot about that new feature. Still it's nicer to have this info exposed via log4j, and with a consistent format (the Java GC logs keep changing format and also look different depending on which collector you're using, if I recall correctly) Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676110#comment-13676110 ] Jonathan Ellis commented on HADOOP-9618: bq. The problem is that the GC logs don't roll FWIW, -XX:+UseGCLogFileRotation is available in JDK 6u34+ and 7u2+. Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676227#comment-13676227 ] Colin Patrick McCabe commented on HADOOP-9618: -- Those are good points. Any pause, whether it is a kernel pause, GC pause, safepoint pause, etc. should be trapped. If the interval turns out to be too short, we can always tweak it later. +1. Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676236#comment-13676236 ] Hitesh Shah commented on HADOOP-9618: - [~tlipcon] does it make sense to expose this information through a metric too ( i.e. increment counters when warn/info levels are hit? Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13674684#comment-13674684 ] Todd Lipcon commented on HADOOP-9618: - BTW, the info reported from the beans seems to be off due to an OpenJDK bug. When I run the same test program with Oracle JDK 1.6.0_14 I get correct stats from the CMS MXBean: 13/06/04 11:36:33 INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3232ms GC pool 'ParNew' had collection(s): count=1 time=56ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=3665ms Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675235#comment-13675235 ] Hadoop QA commented on HADOOP-9618: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12586143/hadoop-9618.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/2599//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/2599//console This message is automatically generated. Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675436#comment-13675436 ] Colin Patrick McCabe commented on HADOOP-9618: -- I kind of wish we could use the JVM's {{Xloggc:logfile}} to get this information, since theoretically it should be more trustworthy than trying to guess. Is that too much hassle to configure by default? I suppose the thread method detects machine pauses which are *not* the result of GCs, so you could say that it gives more information (although perhaps more questionable information). I'm a little gun-shy of the 1 second timeout. It wasn't too long ago that the Linux scheduler quantum was 100 milliseconds. So if you had ten threads hogging the CPU, you'd already have no time left to run your watchdog thread. I think the timeout either needs to be longer, or the thread needs to be a high-priority thread, possibly even realtime priority. Have you tried running this with a gnarly MapReduce job going on? Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675439#comment-13675439 ] Colin Patrick McCabe commented on HADOOP-9618: -- er, that should read 10 milliseconds / 100 CPU-bound threads Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
[ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675471#comment-13675471 ] Todd Lipcon commented on HADOOP-9618: - bq. I kind of wish we could use the JVM's Xloggc:logfile to get this information, since theoretically it should be more trustworthy than trying to guess. Is that too much hassle to configure by default? The problem is that the GC logs don't roll, plus it's difficult to correlate that into the log4j stream, since the timestamps in the GC logs are different format than log4j, etc -- plus they won't rollup through alternate log4j appenders to centralized monitoring. bq. I suppose the thread method detects machine pauses which are not the result of GCs, so you could say that it gives more information (although perhaps more questionable information). Yep - I've seen cases where the kernel locks up for multiple seconds due to some bug, and that's interesting. Also there's JVM safepoint pauses which are nasty and aren't in the gc logs unless you use -XX:+PrintSafepointStatistics, which is super verbose. bq. I'm a little gun-shy of the 1 second timeout. It wasn't too long ago that the Linux scheduler quantum was 100 milliseconds. So if you had ten threads hogging the CPU, you'd already have no time left to run your watchdog thread. I think the timeout either needs to be longer, or the thread needs to be a high-priority thread, possibly even realtime priority. If one of your important Hadoop daemons is so overloaded, I think that would be interesting as well. This only logs if the 1-second pause takes 3 seconds, so things like scheduling jitter won't cause log messages unless the jitter is multiple seconds long. At that point, I'd want to know about it regardless of whether it's GC, a kernel issue, contention for machine resources, swap, etc. Do you disagree? Add thread which detects JVM pauses --- Key: HADOOP-9618 URL: https://issues.apache.org/jira/browse/HADOOP-9618 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 3.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9618.txt Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira