[
https://issues.apache.org/jira/browse/TEZ-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15877721#comment-15877721
]
Jonathan Eagles commented on TEZ-3633:
--------------------------------------
[~rajesh.balamohan], I used the above logging technique to compare
HttpURLConnection to what is happening at the network level. Below is a
HttpURLConnection log and below it is a wireshark capture. You can see that the
HttpURLLogging suggests that keep alive connections are being used but in
reality the remote side issues a reset (RST) FIN and FIN, ACK were removed for
brevity. You can see that when connection is attempted to be reused, the server
resets the connection and the fetcher thread reattempts the request on a new
connection. You can see the reattempts are new connections since they happen on
a different port with the exactly the same request. The job is a 14:1:1
OrderedWordCount job. My guess is that HttpURLConnection sees the keep-alive
response header and is re-connecting silently behind the scenes.
{noformat:title=HttpURLConnection logging}
FINEST: KeepAlive stream used:
http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true
FINEST: KeepAlive stream retrieved from the cache,
sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true)
FINEST: KeepAlive stream used:
http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true
FINEST: KeepAlive stream retrieved from the cache,
sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true)
FINEST: KeepAlive stream used:
http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true
FINEST: KeepAlive stream retrieved from the cache,
sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true)
FINEST: KeepAlive stream used:
http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true
FINEST: KeepAlive stream retrieved from the cache,
sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true)
FINEST: KeepAlive stream used:
http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true
FINEST: KeepAlive stream retrieved from the cache,
sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true)
FINEST: KeepAlive stream used:
http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true
{noformat}
{noformat:title=Wireshark capture}
No Time srcaddr srcport dstaddr dstport
protocol length info
10665 19.595725 127.0.0.1 49670 127.0.0.1 13563 HTTP
446 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true
HTTP/1.1
11726 19.697864 127.0.0.1 49670 127.0.0.1 13563 HTTP
587 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true
HTTP/1.1
11727 19.697939 127.0.0.1 13563 127.0.0.1 49670 TCP
44 13563 → 49670 [RST] Seq=17506303 Win=0 Len=0
11736 19.700396 127.0.0.1 49671 127.0.0.1 13563 HTTP
587 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true
HTTP/1.1
15865 21.339273 127.0.0.1 49671 127.0.0.1 13563 HTTP
399 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true
HTTP/1.1
15866 21.339350 127.0.0.1 13563 127.0.0.1 49671 TCP
44 13563 → 49671 [RST] Seq=44667350 Win=0 Len=0
15871 21.340217 127.0.0.1 49672 127.0.0.1 13563 HTTP
399 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true
HTTP/1.1
16330 21.652935 127.0.0.1 13563 127.0.0.1 49672 TCP
44 13563 → 49672 [RST] Seq=4997533 Win=0 Len=0
16339 21.653976 127.0.0.1 49675 127.0.0.1 13563 HTTP
399 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true
HTTP/1.1
16949 21.963803 127.0.0.1 49675 127.0.0.1 13563 HTTP
446 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true
HTTP/1.1
16950 21.963865 127.0.0.1 13563 127.0.0.1 49675 TCP
44 13563 → 49675 [RST] Seq=7275696 Win=0 Len=0
16959 21.964488 127.0.0.1 49678 127.0.0.1 13563 HTTP
446 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true
HTTP/1.1
17622 21.988410 127.0.0.1 49678 127.0.0.1 13563 HTTP
399 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true
HTTP/1.1
17623 21.988474 127.0.0.1 13563 127.0.0.1 49678 TCP
44 13563 → 49678 [RST] Seq=17117477 Win=0 Len=0
17628 21.989139 127.0.0.1 49679 127.0.0.1 13563 HTTP
399 GET
/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true
HTTP/1.1
{noformat}
> Implement keep-alive timeout in tez shuffle handler
> ---------------------------------------------------
>
> Key: TEZ-3633
> URL: https://issues.apache.org/jira/browse/TEZ-3633
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jonathan Eagles
> Assignee: Jonathan Eagles
> Attachments: TEZ-3633.1.patch
>
>
> MAPREDUCE-5787 which added keep-alive to mapreduce shuffle handler was not
> fully functional as despite advertising keep-alive option and adding the
> header to the response, all connections were closed immediately after write.
> This reduced the performance of certain fetches as now time is spent
> requesting a second get to the same serve, only for that server to reset the
> connection forcing the client to reestablish the connection on another port.
> The details of this is hidden behind HttpURLConnection and doesn't show in
> any log file at default logging level. However TCP sniffing does show errant
> behavior.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)