[ 
https://issues.apache.org/jira/browse/HADOOP-15696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597076#comment-16597076
 ] 

Xiao Chen commented on HADOOP-15696:
------------------------------------

Thanks for the work here [~jojochuang] and [[email protected]], good find and 
discussions!

Everything you said make sense. I'm not entirely sure, but I think originally 
with tomcat, this way of connection setup didn't cause any issue probably 
because the connection was polled by tomcat. This has proven to be problematic 
in Jetty, and glad having the idle timeout dropped helped this situation.

How about we do this jira first to make it configurable, and continue the 
connection reuse improvement as a follow-on? A tricky part for the connection 
reuse is (at least on current trunk) the connection object is created in 
relation to the caller ugi, so one would need to make sure the reuse doesn't 
expose any security concerns. From use case perspective, I agree the connection 
should be reused - and if not, the connection to be proactively closed by the 
client.

Minor code review comment on patch 1:
- Instead of calling it {{HTTP_IDLE_TIMEOUT}} we may call it 
{{HTTP_IDLE_TIMEOUT_MS}} for clarity about the millisecond unit
- It seems httpfs-default could also use such an update.

> KMS performance regression due to too many open file descriptors after Jetty 
> migration
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15696
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15696
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: kms
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Blocker
>         Attachments: HADOOP-15696.001.patch, Screen Shot 2018-08-22 at 
> 11.36.16 AM.png, Screen Shot 2018-08-22 at 4.26.51 PM.png, Screen Shot 
> 2018-08-22 at 4.26.51 PM.png, Screen Shot 2018-08-22 at 4.27.02 PM.png, 
> Screen Shot 2018-08-22 at 4.30.32 PM.png, Screen Shot 2018-08-22 at 4.30.39 
> PM.png, Screen Shot 2018-08-24 at 7.08.16 PM.png
>
>
> We recently found KMS performance regressed in Hadoop 3.0, possibly linking 
> to the migration from Tomcat to Jetty in HADOOP-13597.
> Symptoms:
> # Hadoop 3.x KMS open file descriptors quickly rises to more than 10 thousand 
> under stress, sometimes even exceeds 32K, which is the system limit, causing 
> failures for any access to encryption zones. Our internal testing shows the 
> openfd number was in the range of a few hundred in Hadoop 2.x, and it 
> increases by almost 100x in Hadoop 3.
> # Hadoop 3.x KMS as much as twice the heap size than in Hadoop 2.x. The same 
> heap size can go OOM in Hadoop 3.x. Jxray analysis suggests most of them are 
> temporary byte arrays associated with open SSL connections.
> # Due to the heap usage, Hadoop 3.x KMS has more frequent GC activities, and 
> we observed up to 20% performance reduction due to GC.
> A possible solution is to reduce the idle timeout setting in HttpServer2. It 
> is currently hard-coded 10 seconds. By setting it to 1 second, open fds 
> dropped from 20 thousand down to 3 thousand in my experiment.
> File this jira to invite open discussion for a solution.
> Credit: [[email protected]] for the proposed Jetty idle timeout remedy; 
> [~xiaochen] for digging into this problem.
> Screenshots:
> CDH5 (Hadoop 2) KMS CPU utilization, resident memory and file descriptor 
> chart.
>  !Screen Shot 2018-08-22 at 4.30.39 PM.png! 
> CDH6 (Hadoop 3) KMS CPU utilization, resident memory and file descriptor 
> chart.
>  !Screen Shot 2018-08-22 at 4.30.32 PM.png! 
> CDH5 (Hadoop 2) GC activities on the KMS process
>  !Screen Shot 2018-08-22 at 4.26.51 PM.png! 
> CDH6 (Hadoop 3) GC activities on the KMS process
>  !Screen Shot 2018-08-22 at 4.27.02 PM.png! 
> JXray report
>  !Screen Shot 2018-08-22 at 11.36.16 AM.png! 
> open fd drops from 20 k down to 3k after the proposed change.
>  !Screen Shot 2018-08-24 at 7.08.16 PM.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to