[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-07-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710955#comment-13710955
 ] 

Hudson commented on HADOOP-9618:


SUCCESS: Integrated in Hadoop-Yarn-trunk #273 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/273/])
HADOOP-9618.  thread which detects GC pauses (Todd Lipcon via Colin Patrick 
McCabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java


 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 2.2.0

 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-07-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711062#comment-13711062
 ] 

Hudson commented on HADOOP-9618:


FAILURE: Integrated in Hadoop-Hdfs-trunk #1463 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1463/])
HADOOP-9618.  thread which detects GC pauses (Todd Lipcon via Colin Patrick 
McCabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java


 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 2.2.0

 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-07-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711109#comment-13711109
 ] 

Hudson commented on HADOOP-9618:


SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1490 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1490/])
HADOOP-9618.  thread which detects GC pauses (Todd Lipcon via Colin Patrick 
McCabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java


 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 2.2.0

 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709997#comment-13709997
 ] 

Hudson commented on HADOOP-9618:


SUCCESS: Integrated in Hadoop-trunk-Commit #4091 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4091/])
HADOOP-9618.  thread which detects GC pauses (Todd Lipcon via Colin Patrick 
McCabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1503806)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java


 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-07-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699112#comment-13699112
 ] 

Colin Patrick McCabe commented on HADOOP-9618:
--

This kind of info will be really useful in debugging.

+1.

If there are no more comments, I'll commit in a day or two.

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-07-03 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699231#comment-13699231
 ] 

Aaron T. Myers commented on HADOOP-9618:


Patch looks pretty good to me as well. My only suggestion would be to also 
consider putting one of these in the JN, ZKFC, and SecondaryNameNode daemons as 
well.

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-06 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677214#comment-13677214
 ] 

Todd Lipcon commented on HADOOP-9618:
-

Hey Hitesh. We already have the EventCounter log4j appender. Rather than 
making one-off metrics for this, I think we should just extend the EventCounter 
to take a list of regular expressions to map to metrics in the logs - eg you 
could say something the following in the log4j configuration:

{code}
log4j.appender.EventCounter.WARN.gc-pauses=Detected pause in JVM
{code}

Does that seem like a more general way of achieving the above?

bq. FWIW, -XX:UseGCLogFileRotation is available in JDK 6u34 and 7u2+.
Thanks, I forgot about that new feature. Still it's nicer to have this info 
exposed via log4j, and with a consistent format (the Java GC logs keep changing 
format and also look different depending on which collector you're using, if I 
recall correctly)

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-05 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676110#comment-13676110
 ] 

Jonathan Ellis commented on HADOOP-9618:


bq. The problem is that the GC logs don't roll

FWIW, -XX:+UseGCLogFileRotation is available in JDK 6u34+ and 7u2+.

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-05 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676227#comment-13676227
 ] 

Colin Patrick McCabe commented on HADOOP-9618:
--

Those are good points.  Any pause, whether it is a kernel pause, GC pause, 
safepoint pause, etc. should be trapped.  If the interval turns out to be too 
short, we can always tweak it later.  +1.

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-05 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676236#comment-13676236
 ] 

Hitesh Shah commented on HADOOP-9618:
-

[~tlipcon] does it make sense to expose this information through a metric too ( 
i.e. increment counters when warn/info levels are hit? 

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13674684#comment-13674684
 ] 

Todd Lipcon commented on HADOOP-9618:
-

BTW, the info reported from the beans seems to be off due to an OpenJDK bug. 
When I run the same test program with Oracle JDK 1.6.0_14 I get correct stats 
from the CMS MXBean:

13/06/04 11:36:33 INFO util.JvmPauseMonitor: Detected pause in JVM or host 
machine (eg GC): pause of approximately 3232ms
GC pool 'ParNew' had collection(s): count=1 time=56ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=3665ms


 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675235#comment-13675235
 ] 

Hadoop QA commented on HADOOP-9618:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12586143/hadoop-9618.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/2599//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/2599//console

This message is automatically generated.

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675436#comment-13675436
 ] 

Colin Patrick McCabe commented on HADOOP-9618:
--

I kind of wish we could use the JVM's {{Xloggc:logfile}} to get this 
information, since theoretically it should be more trustworthy than trying to 
guess.  Is that too much hassle to configure by default?

I suppose the thread method detects machine pauses which are *not* the result 
of GCs, so you could say that it gives more information (although perhaps more 
questionable information).

I'm a little gun-shy of the 1 second timeout.  It wasn't too long ago that the 
Linux scheduler quantum was 100 milliseconds.  So if you had ten threads 
hogging the CPU, you'd already have no time left to run your watchdog thread.  
I think the timeout either needs to be longer, or the thread needs to be a 
high-priority thread, possibly even realtime priority.

Have you tried running this with a gnarly MapReduce job going on?

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675439#comment-13675439
 ] 

Colin Patrick McCabe commented on HADOOP-9618:
--

er, that should read 10 milliseconds / 100 CPU-bound threads

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses

2013-06-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13675471#comment-13675471
 ] 

Todd Lipcon commented on HADOOP-9618:
-

bq. I kind of wish we could use the JVM's Xloggc:logfile to get this 
information, since theoretically it should be more trustworthy than trying to 
guess. Is that too much hassle to configure by default?

The problem is that the GC logs don't roll, plus it's difficult to correlate 
that into the log4j stream, since the timestamps in the GC logs are different 
format than log4j, etc -- plus they won't rollup through alternate log4j 
appenders to centralized monitoring.

bq. I suppose the thread method detects machine pauses which are not the result 
of GCs, so you could say that it gives more information (although perhaps more 
questionable information).

Yep - I've seen cases where the kernel locks up for multiple seconds due to 
some bug, and that's interesting. Also there's JVM safepoint pauses which are 
nasty and aren't in the gc logs unless you use -XX:+PrintSafepointStatistics, 
which is super verbose.

bq. I'm a little gun-shy of the 1 second timeout. It wasn't too long ago that 
the Linux scheduler quantum was 100 milliseconds. So if you had ten threads 
hogging the CPU, you'd already have no time left to run your watchdog thread. I 
think the timeout either needs to be longer, or the thread needs to be a 
high-priority thread, possibly even realtime priority.

If one of your important Hadoop daemons is so overloaded, I think that would be 
interesting as well. This only logs if the 1-second pause takes 3 seconds, so 
things like scheduling jitter won't cause log messages unless the jitter is 
multiple seconds long. At that point, I'd want to know about it regardless of 
whether it's GC, a kernel issue, contention for machine resources, swap, etc. 
Do you disagree?

 Add thread which detects JVM pauses
 ---

 Key: HADOOP-9618
 URL: https://issues.apache.org/jira/browse/HADOOP-9618
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 3.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-9618.txt


 Often times users struggle to understand what happened when a long JVM pause 
 (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For 
 example, a long GC pause while logging an edit to the QJM may cause the edit 
 to timeout, or a long GC pause may make other IPCs to the NameNode timeout. 
 We should add a simple thread which loops on 1-second sleeps, and if the 
 sleep ever takes significantly longer than 1 second, log a WARN. This will 
 make GC pauses obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira