[
https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251961#comment-15251961
]
Robert Joseph Evans commented on MAPREDUCE-6684:
------------------------------------------------
It has been a while since I did anything with this (2013 from the edit logs).
As such I don't remember too much about the code so feel free to take my advice
or leave it, as I could be wrong. Now that I finished with the all the
disclaimers I believe that the reason we scan the intermediate directory first
is to not drop any files that might be in the middle of being moved from
intermediate_done to done. So if you look in the older directory first you
might miss some files that are in the middle of being moved, and then when you
scan intermediate_done you wouldn't see those files because they have already
been moved. So if you think there is a high probability that you will find the
files you want in the done dir then you would need to
scan the done dir
if not found scan intermediate_done
//Because something might have been in flight
if still not found scan the done dir again.
>From our experience the majority of the hits to the history server for a
>job/application happen right after the application finishes. So at least for
>our usage pattern this would not help much, and would probably add a lot more
>load to the namenode, something that we are much more concerned about.
Having the scanner be on a background thread would help, but you would need
some synchronization that is a bit more complex then just a waking the thread
up and then waiting for a notification that it was done. From the clients
perspective if a history file was placed in the intermediate_done dir and then
they made a call to the history server it had better pick up that file and
return a result. So we need to be sure that the waiting thread will only be
woken up in the case where nothing was found if the scan started after the
request was made.
> High contention on scanning of user directory under immediate_done in Job
> History Server
> ----------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6684
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver
> Affects Versions: 2.7.0
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Priority: Critical
> Attachments: jhs-jstacks-service-monitor-running.tar.gz,
> jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each
> user directory it tries to scan (move or delete files under the user
> directory as necessary). This method is called in a thread in JobHistory that
> performs periodical scanning of intermediate directory, and can also be
> called by web server threads for each Web API call made by a JHS client. In
> cases where there are many concurrent Web API calls/connections to JHS, all
> but one thread are blocked on the lock on the user directory. Eventually,
> client connects will time out, but the threads in JHS will not be killed and
> leave a lot of TCP connections in CLOSE_WAIT state.
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp 0 0 10.17.202.19:10020 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.198.30:33010
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.200.30:33980
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.10:59625
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10020 10.17.202.13:35765
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:10033 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 0.0.0.0:*
> LISTEN 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:35103
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:43670
> ESTABLISHED 63729/java
> tcp 0 0 10.17.202.19:19888 10.17.198.30:45453
> ESTABLISHED 63729/java
> tcp 277 0 10.17.202.19:19888 10.17.198.30:49184
> ESTABLISHED 63729/java
> tcp 1 0 10.17.202.19:19888 10.17.202.13:49992
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52703
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52707
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52708
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52710
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52714
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52723
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52726
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52727
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52739
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52749
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52753
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52757
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52760
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52820
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52827
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52829
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52831
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52833
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52836
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52839
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52841
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52843
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52850
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52860
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52876
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52879
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52881
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52884
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52886
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52888
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52891
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52893
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52896
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52898
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52899
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52902
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52909
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52910
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52912
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52923
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52925
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52927
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52930
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52937
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52939
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52945
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52947
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52969
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:52972
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:52975
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53004
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53007
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53009
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53011
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53052
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53058
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53059
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53063
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53071
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53084
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53093
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53095
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53097
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53101
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53104
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53106
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53108
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53110
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53112
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53114
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53115
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53117
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53121
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53123
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53125
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53127
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53129
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53131
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53134
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53138
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53140
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53153
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53155
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53157
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53159
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53173
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53176
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53177
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53178
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53179
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53181
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53183
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53201
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53204
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53218
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53267
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53270
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53275
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53278
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53280
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53283
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53293
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53296
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53299
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53309
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53312
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53314
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53317
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53320
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53322
> CLOSE_WAIT 63729/java
> tcp 256 0 10.17.202.19:19888 10.17.202.13:53338
> CLOSE_WAIT 63729/java
> tcp 261 0 10.17.202.19:19888 10.17.202.13:53340
> CLOSE_WAIT 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53364
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53366
> ESTABLISHED 63729/java
> tcp 260 0 10.17.202.19:19888 10.17.202.13:53367
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53380
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53382
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53386
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53390
> ESTABLISHED 63729/java
> tcp 255 0 10.17.202.19:19888 10.17.202.13:53392
> ESTABLISHED 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45301
> CLOSE_WAIT 63729/java
> tcp 1278 0 10.17.202.19:19888 10.17.202.18:45303
> CLOSE_WAIT 63729/java
> tcp 1277 0 10.17.202.19:19888 10.17.202.18:45306
> ESTABLISHED 63729/java
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)