[
https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254314#comment-15254314
]
Haibo Chen commented on MAPREDUCE-6684:
---------------------------------------
Thanks a lot for your insight on why intermediated directory is scanned before
done directory and potential name node issue, [~revans2]. That makes a lot of
sense. Per offline discussion with [[email protected]], we'd
like to propose three approaches.
1. For web API requests for individual jobs, the intermediate directory is
still scanned first, but inside scanIntermediateDir(), we could add checking of
existence of the jhst files of the associated job (), and only when the files
do exist do we move files in intermediate directory to done directory. The
assumption is that file existence is not expensive, and if the files do not
exist in intermediate directory, we only acquire the lock on the user directory
for a short period of time.
2. For web API requests of individual jobs, when intermediate directory is
scanned, check the existence of the job files, and only files of the job
associated with the request are moved from intermediate directory to done
directory. This reduces the time for which each job web request thread blocks,
but may have much smaller overall throughput that the previous approach when
file moving is done in batch.
3. Have a dedicated thread to scan the intermediate directory and other threads
to wait on a monitor associated with a particular job. When the dedicated
thread finishes, threads waiting on the monitors will be notified. By having a
single writer, the contention on the user directory lock can be reduced. But it
does have the problem of conflicting with clients' expectation as [~revans2]
pointed out in previous comment.
Can you please share some of your thoughts on them, [~revans2], [~jlowe]?
> High contention on scanning of user directory under immediate_done in Job
> History Server
> ----------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6684
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver
> Affects Versions: 2.7.0
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Priority: Critical
> Attachments: jhs-jstacks-service-monitor-running.tar.gz,
> jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each
> user directory it tries to scan (move or delete files under the user
> directory as necessary). This method is called in a thread in JobHistory that
> performs periodical scanning of intermediate directory, and can also be
> called by web server threads for each Web API call made by a JHS client. In
> cases where there are many concurrent Web API calls/connections to JHS, all
> but one thread are blocked on the lock on the user directory. Eventually,
> client connects will time out, but the threads in JHS will not be killed and
> leave a lot of TCP connections in CLOSE_WAIT state.
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp 0 0 10.17.202.19:10020 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.198.30:33010
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.200.30:33980
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.10:59625
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.13:35765
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10033 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:35103
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:43670
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:45453
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:49184
> ESTABLISHED 63729/java
> tcp 1 0 10.17.202.19:19888 10.17.202.13:49992
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52703
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52707
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52708
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52710
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52714
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52723
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52726
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52727
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52739
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52749
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52753
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52757
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52760
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52820
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52827
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52829
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52831
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52833
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52836
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52839
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52841
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52843
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52850
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52860
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52876
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52879
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52881
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52884
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52886
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52888
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52891
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52893
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52896
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52898
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52899
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52902
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52909
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52910
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52912
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52923
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52925
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52927
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52930
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52937
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52939
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52945
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52947
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52969
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52972
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52975
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53004
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53007
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53009
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53011
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53052
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53058
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53059
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53063
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53071
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53084
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53093
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53095
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53097
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53101
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53104
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53106
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53108
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53110
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53112
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53114
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53115
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53117
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53121
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53123
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53125
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53127
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53129
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53131
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53134
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53138
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53140
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53153
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53155
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53157
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53159
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53173
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53176
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53177
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53178
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53179
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53181
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53183
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53201
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53204
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53218
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53267
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53270
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53275
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53278
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53280
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53283
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53293
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53296
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53299
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53309
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53312
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53314
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53317
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53320
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53322
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53338
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53340
> CLOSE_WAIT 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53364
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53366
> ESTABLISHED 63729/java
> tcp 260 0 10.17.202.19:19888 10.17.202.13:53367
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53380
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53382
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53386
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53390
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53392
> ESTABLISHED 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45301
> CLOSE_WAIT 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45303
> CLOSE_WAIT 63729/java
> tcp 1277 0 10.17.202.19:19888 10.17.202.18:45306
> ESTABLISHED 63729/java
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)