[
https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251116#comment-15251116
]
Karthik Kambatla commented on MAPREDUCE-6684:
---------------------------------------------
Is it okay to scan the done directory first as Haibo suggested?
In addition to that, I wonder if there should be a single thread (with sleeps)
moving files from intermediate to done. Other threads needing a particular
job's jhist file could wake this moving thread and register for updates for
that file. Once a file is moved, the moving thread could notify these other
waiting threads. That way, we ensure serving threads wait only as long as
required.
[~jlowe], [~revans2] - what do you think?
> High contention on scanning of user directory under immediate_done in Job
> History Server
> ----------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6684
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver
> Affects Versions: 2.7.0
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Priority: Critical
> Attachments: jhs-jstacks-service-monitor-running.tar.gz,
> jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each
> user directory it tries to scan (move or delete files under the user
> directory as necessary). This method is called in a thread in JobHistory that
> performs periodical scanning of intermediate directory, and can also be
> called by web server threads for each Web API call made by a JHS client. In
> cases where there are many concurrent Web API calls/connections to JHS, all
> but one thread are blocked on the lock on the user directory. Eventually,
> client connects will time out, but the threads in JHS will not be killed and
> leave a lot of TCP connections in CLOSE_WAIT state.
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp 0 0 10.17.202.19:10020 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.198.30:33010
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.200.30:33980
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.10:59625
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.13:35765
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10033 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:35103
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:43670
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:45453
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:49184
> ESTABLISHED 63729/java
> tcp 1 0 10.17.202.19:19888 10.17.202.13:49992
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52703
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52707
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52708
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52710
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52714
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52723
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52726
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52727
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52739
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52749
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52753
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52757
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52760
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52820
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52827
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52829
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52831
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52833
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52836
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52839
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52841
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52843
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52850
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52860
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52876
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52879
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52881
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52884
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52886
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52888
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52891
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52893
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52896
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52898
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52899
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52902
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52909
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52910
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52912
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52923
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52925
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52927
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52930
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52937
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52939
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52945
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52947
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52969
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52972
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52975
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53004
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53007
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53009
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53011
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53052
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53058
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53059
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53063
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53071
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53084
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53093
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53095
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53097
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53101
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53104
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53106
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53108
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53110
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53112
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53114
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53115
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53117
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53121
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53123
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53125
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53127
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53129
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53131
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53134
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53138
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53140
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53153
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53155
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53157
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53159
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53173
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53176
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53177
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53178
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53179
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53181
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53183
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53201
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53204
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53218
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53267
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53270
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53275
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53278
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53280
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53283
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53293
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53296
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53299
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53309
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53312
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53314
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53317
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53320
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53322
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53338
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53340
> CLOSE_WAIT 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53364
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53366
> ESTABLISHED 63729/java
> tcp 260 0 10.17.202.19:19888 10.17.202.13:53367
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53380
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53382
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53386
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53390
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53392
> ESTABLISHED 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45301
> CLOSE_WAIT 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45303
> CLOSE_WAIT 63729/java
> tcp 1277 0 10.17.202.19:19888 10.17.202.18:45306
> ESTABLISHED 63729/java
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)