[ https://issues.apache.org/jira/browse/TEZ-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15877721#comment-15877721 ]
Jonathan Eagles edited comment on TEZ-3633 at 2/22/17 4:17 PM: --------------------------------------------------------------- [~rajesh.balamohan], I used the above logging technique to compare HttpURLConnection to what is happening at the network level. Below is a HttpURLConnection log and below it is a wireshark capture. You can see that the HttpURLLogging suggests that keep alive connections are being used but in reality the remote side issues a reset (RST) FIN and FIN, ACK were removed for brevity. You can see that when connection is attempted to be reused, the server resets the connection and the fetcher thread reattempts the request on a new connection. You can see the reattempts are new connections since they happen on a different port with the exactly the same request. The job is a 14:1:1 OrderedWordCount job. My guess is that HttpURLConnection sees the keep-alive response header and is re-connecting silently behind the scenes. Do you have a way to get a network sniff of the above job to see if you are seeing the same behavior? {noformat:title=HttpURLConnection logging} FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true {noformat} {noformat:title=Wireshark capture} No Time srcaddr srcport dstaddr dstport protocol length info 10665 19.595725 127.0.0.1 49670 127.0.0.1 13563 HTTP 446 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true HTTP/1.1 11726 19.697864 127.0.0.1 49670 127.0.0.1 13563 HTTP 587 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true HTTP/1.1 11727 19.697939 127.0.0.1 13563 127.0.0.1 49670 TCP 44 13563 → 49670 [RST] Seq=17506303 Win=0 Len=0 11736 19.700396 127.0.0.1 49671 127.0.0.1 13563 HTTP 587 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true HTTP/1.1 13758 19.886892 127.0.0.1 13563 127.0.0.1 49671 TCP 56 13563 → 49671 [FIN, ACK] Seq=44667349 Ack=532 Win=407744 Len=0 TSval=881231323 TSecr=881231323 15865 21.339273 127.0.0.1 49671 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true HTTP/1.1 15866 21.339350 127.0.0.1 13563 127.0.0.1 49671 TCP 44 13563 → 49671 [RST] Seq=44667350 Win=0 Len=0 15871 21.340217 127.0.0.1 49672 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true HTTP/1.1 16330 21.652935 127.0.0.1 13563 127.0.0.1 49672 TCP 44 13563 → 49672 [RST] Seq=4997533 Win=0 Len=0 16339 21.653976 127.0.0.1 49675 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true HTTP/1.1 16949 21.963803 127.0.0.1 49675 127.0.0.1 13563 HTTP 446 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true HTTP/1.1 16950 21.963865 127.0.0.1 13563 127.0.0.1 49675 TCP 44 13563 → 49675 [RST] Seq=7275696 Win=0 Len=0 16959 21.964488 127.0.0.1 49678 127.0.0.1 13563 HTTP 446 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true HTTP/1.1 17617 21.982604 127.0.0.1 13563 127.0.0.1 49678 TCP 56 13563 → 49678 [FIN, ACK] Seq=17117476 Ack=391 Win=407904 Len=0 TSval=881233239 TSecr=881233238 17622 21.988410 127.0.0.1 49678 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true HTTP/1.1 17623 21.988474 127.0.0.1 13563 127.0.0.1 49678 TCP 44 13563 → 49678 [RST] Seq=17117477 Win=0 Len=0 17628 21.989139 127.0.0.1 49679 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true HTTP/1.1 18080 21.998431 127.0.0.1 13563 127.0.0.1 49679 TCP 56 13563 → 49679 [FIN, ACK] Seq=8188579 Ack=344 Win=407936 Len=0 TSval=881233253 TSecr=881233253 {noformat} was (Author: jeagles): [~rajesh.balamohan], I used the above logging technique to compare HttpURLConnection to what is happening at the network level. Below is a HttpURLConnection log and below it is a wireshark capture. You can see that the HttpURLLogging suggests that keep alive connections are being used but in reality the remote side issues a reset (RST) FIN and FIN, ACK were removed for brevity. You can see that when connection is attempted to be reused, the server resets the connection and the fetcher thread reattempts the request on a new connection. You can see the reattempts are new connections since they happen on a different port with the exactly the same request. The job is a 14:1:1 OrderedWordCount job. My guess is that HttpURLConnection sees the keep-alive response header and is re-connecting silently behind the scenes. Do you have a way to get a network sniff of the above job to see if you are seeing the same behavior? {noformat:title=HttpURLConnection logging} FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true FINEST: KeepAlive stream retrieved from the cache, sun.net.www.http.HttpClient(http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true) FINEST: KeepAlive stream used: http://localhost:13563/mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true {noformat} {noformat:title=Wireshark capture} No Time srcaddr srcport dstaddr dstport protocol length info 10665 19.595725 127.0.0.1 49670 127.0.0.1 13563 HTTP 446 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000004_0_10003,attempt_1487743755036_0018_1_00_000005_0_10003&keepAlive=true HTTP/1.1 11726 19.697864 127.0.0.1 49670 127.0.0.1 13563 HTTP 587 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true HTTP/1.1 11727 19.697939 127.0.0.1 13563 127.0.0.1 49670 TCP 44 13563 → 49670 [RST] Seq=17506303 Win=0 Len=0 11736 19.700396 127.0.0.1 49671 127.0.0.1 13563 HTTP 587 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000002_0_10003,attempt_1487743755036_0018_1_00_000000_0_10003,attempt_1487743755036_0018_1_00_000003_0_10003,attempt_1487743755036_0018_1_00_000006_0_10003,attempt_1487743755036_0018_1_00_000001_0_10003&keepAlive=true HTTP/1.1 15865 21.339273 127.0.0.1 49671 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true HTTP/1.1 15866 21.339350 127.0.0.1 13563 127.0.0.1 49671 TCP 44 13563 → 49671 [RST] Seq=44667350 Win=0 Len=0 15871 21.340217 127.0.0.1 49672 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000011_0_10006&keepAlive=true HTTP/1.1 16330 21.652935 127.0.0.1 13563 127.0.0.1 49672 TCP 44 13563 → 49672 [RST] Seq=4997533 Win=0 Len=0 16339 21.653976 127.0.0.1 49675 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000010_0_10005&keepAlive=true HTTP/1.1 16949 21.963803 127.0.0.1 49675 127.0.0.1 13563 HTTP 446 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true HTTP/1.1 16950 21.963865 127.0.0.1 13563 127.0.0.1 49675 TCP 44 13563 → 49675 [RST] Seq=7275696 Win=0 Len=0 16959 21.964488 127.0.0.1 49678 127.0.0.1 13563 HTTP 446 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000007_0_10006,attempt_1487743755036_0018_1_00_000008_0_10006&keepAlive=true HTTP/1.1 17622 21.988410 127.0.0.1 49678 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true HTTP/1.1 17623 21.988474 127.0.0.1 13563 127.0.0.1 49678 TCP 44 13563 → 49678 [RST] Seq=17117477 Win=0 Len=0 17628 21.989139 127.0.0.1 49679 127.0.0.1 13563 HTTP 399 GET /mapOutput?job=job_1487743755036_0018&dag=1&reduce=0&map=attempt_1487743755036_0018_1_00_000009_0_10006&keepAlive=true HTTP/1.1 {noformat} > Implement keep-alive timeout in tez shuffle handler > --------------------------------------------------- > > Key: TEZ-3633 > URL: https://issues.apache.org/jira/browse/TEZ-3633 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Jonathan Eagles > Assignee: Jonathan Eagles > Attachments: TEZ-3633.1.patch > > > MAPREDUCE-5787 which added keep-alive to mapreduce shuffle handler was not > fully functional as despite advertising keep-alive option and adding the > header to the response, all connections were closed immediately after write. > This reduced the performance of certain fetches as now time is spent > requesting a second get to the same serve, only for that server to reset the > connection forcing the client to reestablish the connection on another port. > The details of this is hidden behind HttpURLConnection and doesn't show in > any log file at default logging level. However TCP sniffing does show errant > behavior. -- This message was sent by Atlassian JIRA (v6.3.15#6346)