[
https://issues.apache.org/jira/browse/TEZ-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated TEZ-4357:
------------------------------
Description:
Currently, when Fetcher and FetcherOrderedGrouped fail on getInputStream, like:
{code}
2021-12-03 08:32:04,634 [WARN] [Fetcher_O {expDataFile} #11]
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting
from hwc7213-6.hwc7213.root.hwx.site to hwc7213-7.hwc7213.root.hwx.site:13562
with 1 inputs pending
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at
org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:362)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
{code}
they don't report the URL, which is important to understand the failure...I'm
investigating a shuffle failure, looking at INFO level logs, and in case of
failure, the url itself is not printed, but I can see in the ShuffleHandler
logs that the request is not a valid ssl request, which makes me think that
fetcher simply works with invalid settings...if I saw the base url at least, I
could make sure it was trying to connect in a secure way (--> protocol: https)
logging isSecureShuffle could also be useful, but url contains every needed
info we need in case of a failure
was:
Currently, when Fetcher and FetcherOrderedGrouped fail on getInputStream, like:
{code}
2021-12-03 08:32:04,634 [WARN] [Fetcher_O {expDataFile} #11]
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting
from hwc7213-6.hwc7213.root.hwx.site to hwc7213-7.hwc7213.root.hwx.site:13562
with 1 inputs pending
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at
org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:362)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
{code}
they don't report the full URL, which is important to understand the
failure...I'm investigating a shuffle failure, looking at an INFO level logs,
and in case of failure, the url itself is not printed, but I can see int the
ShuffleHandler logs that the request is not a valid ssl request, which makes me
think that fetcher simply works with invalid settings...if I saw the full url,
I could make sure it was trying to connect in a secure way (--> protocol: https)
logging isSecureShuffle could also be useful, but full url contains every
needed info we need in case of a failure
> Report url to logs in case of fetcher connection failure
> --------------------------------------------------------
>
> Key: TEZ-4357
> URL: https://issues.apache.org/jira/browse/TEZ-4357
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
>
> Currently, when Fetcher and FetcherOrderedGrouped fail on getInputStream,
> like:
> {code}
> 2021-12-03 08:32:04,634 [WARN] [Fetcher_O {expDataFile} #11]
> |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after
> connecting from hwc7213-6.hwc7213.root.hwx.site to
> hwc7213-7.hwc7213.root.hwx.site:13562 with 1 inputs pending
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
> at
> org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:362)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
> {code}
> they don't report the URL, which is important to understand the failure...I'm
> investigating a shuffle failure, looking at INFO level logs, and in case of
> failure, the url itself is not printed, but I can see in the ShuffleHandler
> logs that the request is not a valid ssl request, which makes me think that
> fetcher simply works with invalid settings...if I saw the base url at least,
> I could make sure it was trying to connect in a secure way (--> protocol:
> https)
> logging isSecureShuffle could also be useful, but url contains every needed
> info we need in case of a failure
--
This message was sent by Atlassian Jira
(v8.20.1#820001)