[
https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259429#comment-15259429
]
Karthik Kambatla commented on MAPREDUCE-6684:
---------------------------------------------
[~jlowe] - thanks for the suggestion, looks like that should do the trick and
is definitely a lot simpler and less risky than the options proposed earlier.
[~revans2] - thanks for pointing out the load on NN, we hadn't considered that
either.
Unrelated, the code surrounding moving the files is pretty complex. Do you
think there is any value in cleaning it up at all? The code seems to work fine.
And, if we are going to make any changes, I guess we would probably just want
to use ATS v2 when that is ready.
> High contention on scanning of user directory under immediate_done in Job
> History Server
> ----------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6684
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver
> Affects Versions: 2.7.0
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Priority: Critical
> Attachments: jhs-jstacks-service-monitor-running.tar.gz,
> jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each
> user directory it tries to scan (move or delete files under the user
> directory as necessary). This method is called in a thread in JobHistory that
> performs periodical scanning of intermediate directory, and can also be
> called by web server threads for each Web API call made by a JHS client. In
> cases where there are many concurrent Web API calls/connections to JHS, all
> but one thread are blocked on the lock on the user directory. Eventually,
> client connects will time out, but the threads in JHS will not be killed and
> leave a lot of TCP connections in CLOSE_WAIT state.
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp 0 0 10.17.202.19:10020 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.198.30:33010
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.200.30:33980
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.10:59625
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.13:35765
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10033 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:35103
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:43670
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:45453
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:49184
> ESTABLISHED 63729/java
> tcp 1 0 10.17.202.19:19888 10.17.202.13:49992
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52703
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52707
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52708
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52710
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52714
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52723
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52726
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52727
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52739
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52749
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52753
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52757
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52760
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52820
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52827
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52829
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52831
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52833
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52836
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52839
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52841
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52843
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52850
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52860
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52876
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52879
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52881
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52884
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52886
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52888
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52891
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52893
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52896
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52898
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52899
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52902
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52909
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52910
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52912
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52923
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52925
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52927
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52930
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52937
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52939
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52945
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52947
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52969
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52972
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52975
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53004
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53007
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53009
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53011
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53052
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53058
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53059
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53063
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53071
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53084
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53093
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53095
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53097
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53101
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53104
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53106
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53108
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53110
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53112
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53114
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53115
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53117
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53121
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53123
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53125
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53127
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53129
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53131
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53134
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53138
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53140
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53153
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53155
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53157
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53159
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53173
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53176
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53177
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53178
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53179
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53181
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53183
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53201
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53204
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53218
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53267
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53270
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53275
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53278
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53280
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53283
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53293
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53296
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53299
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53309
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53312
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53314
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53317
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53320
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53322
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53338
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53340
> CLOSE_WAIT 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53364
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53366
> ESTABLISHED 63729/java
> tcp 260 0 10.17.202.19:19888 10.17.202.13:53367
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53380
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53382
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53386
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53390
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53392
> ESTABLISHED 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45301
> CLOSE_WAIT 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45303
> CLOSE_WAIT 63729/java
> tcp 1277 0 10.17.202.19:19888 10.17.202.18:45306
> ESTABLISHED 63729/java
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)