[ 
https://issues.apache.org/jira/browse/HADOOP-16850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated HADOOP-16850:
------------------------------
    Description: 
Recently we found jmx request taken almost 5s+ to be done when there were 1w+ 
threads in a stressed datanode process, meanwhile other http requests were 
blocked and some disk operations were affected (we can see many "Slow 
manageWriterOsCache" messages in DN log, and these messages were hard to be 
seen again after we stopped sending jxm requests)

The excessive time is spent in getting thread info via ThreadMXBean inside 
which ThreadImpl#getThreadInfo native method is called, the time complexity of 
ThreadImpl#getThreadInfo is O(n*n) according to 
[JDK-8185005|https://bugs.openjdk.java.net/browse/JDK-8185005] and it holds 
global thread lock and prevents creation or termination of threads.

To improve this, I propose to support getting thread info from thread group 
which will improve a lot by default, also support using original approach when 
"-Dhadoop.metrics.jvm.use-thread-mxbean=true" is configured in the startup 
command.

An example of performance tests between these two approaches is as follows:
{noformat}
#Threads=100, ThreadMXBean=382372 ns, ThreadGroup=72046 ns, ratio: 5
#Threads=200, ThreadMXBean=776619 ns, ThreadGroup=83875 ns, ratio: 9
#Threads=500, ThreadMXBean=3392954 ns, ThreadGroup=216269 ns, ratio: 15
#Threads=1000, ThreadMXBean=9475768 ns, ThreadGroup=220447 ns, ratio: 42
#Threads=2000, ThreadMXBean=53833729 ns, ThreadGroup=579608 ns, ratio: 92
#Threads=3000, ThreadMXBean=196829971 ns, ThreadGroup=1157670 ns, ratio: 170
{noformat}

  was:
Recently we found jmx request taken almost 5s+ to be done when there were 1w+ 
threads in a stressed datanode process, meanwhile other http requests were 
blocked and some disk operations were affected (we can see many "Slow 
manageWriterOsCache" messages in DN log, and these messages were hard to be 
seen again after we stopped sending jxm requests)

The excessive time is spent in getting thread info via ThreadMXBean inside 
which ThreadImpl#getThreadInfo native method is called, the time complexity of 
ThreadImpl#getThreadInfo is O(n*n) according to JDK-8185005 and it may held 
global thread lock (prevent creation or termination of threads) for a long time.

To improve this, I propose to support getting thread info from thread group 
which will improve a lot by default, also support using original approach when 
"-Dhadoop.metrics.jvm.use-thread-mxbean=true" is configured in the startup 
command.

An example of performance tests between these two approaches is as follows:
{noformat}
#Threads=100, ThreadMXBean=382372 ns, ThreadGroup=72046 ns, ratio: 5
#Threads=200, ThreadMXBean=776619 ns, ThreadGroup=83875 ns, ratio: 9
#Threads=500, ThreadMXBean=3392954 ns, ThreadGroup=216269 ns, ratio: 15
#Threads=1000, ThreadMXBean=9475768 ns, ThreadGroup=220447 ns, ratio: 42
#Threads=2000, ThreadMXBean=53833729 ns, ThreadGroup=579608 ns, ratio: 92
#Threads=3000, ThreadMXBean=196829971 ns, ThreadGroup=1157670 ns, ratio: 170
{noformat}


> Support getting thread info from thread group for JvmMetrics to improve the 
> performance
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16850
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16850
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 2.8.6, 2.9.3, 3.1.4, 3.2.2, 2.10.1, 3.3.1
>            Reporter: Tao Yang
>            Priority: Major
>
> Recently we found jmx request taken almost 5s+ to be done when there were 1w+ 
> threads in a stressed datanode process, meanwhile other http requests were 
> blocked and some disk operations were affected (we can see many "Slow 
> manageWriterOsCache" messages in DN log, and these messages were hard to be 
> seen again after we stopped sending jxm requests)
> The excessive time is spent in getting thread info via ThreadMXBean inside 
> which ThreadImpl#getThreadInfo native method is called, the time complexity 
> of ThreadImpl#getThreadInfo is O(n*n) according to 
> [JDK-8185005|https://bugs.openjdk.java.net/browse/JDK-8185005] and it holds 
> global thread lock and prevents creation or termination of threads.
> To improve this, I propose to support getting thread info from thread group 
> which will improve a lot by default, also support using original approach 
> when "-Dhadoop.metrics.jvm.use-thread-mxbean=true" is configured in the 
> startup command.
> An example of performance tests between these two approaches is as follows:
> {noformat}
> #Threads=100, ThreadMXBean=382372 ns, ThreadGroup=72046 ns, ratio: 5
> #Threads=200, ThreadMXBean=776619 ns, ThreadGroup=83875 ns, ratio: 9
> #Threads=500, ThreadMXBean=3392954 ns, ThreadGroup=216269 ns, ratio: 15
> #Threads=1000, ThreadMXBean=9475768 ns, ThreadGroup=220447 ns, ratio: 42
> #Threads=2000, ThreadMXBean=53833729 ns, ThreadGroup=579608 ns, ratio: 92
> #Threads=3000, ThreadMXBean=196829971 ns, ThreadGroup=1157670 ns, ratio: 170
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to