[
https://issues.apache.org/jira/browse/TEZ-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated TEZ-4097:
------------------------------
Description:
Currently, a fetch failure is reported like this:
{code}
2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: Fetch
Failure from host while connecting: other_host, attempt: InputAttemptIdentifier
[inputIdentifier=1, attemptNumber=0,
pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0,
spillId=-1] Informing ShuffleManager:
java.net.SocketTimeoutException: Read timed out
...
{code}
For debugging network/ssl/etc. issues on cluster, it would be convenient to see
the local host's name in these messages (which is present in the fetcher as
localHostname property), as in the logs collected by yarn cli, it's not obvious
for the first sight.
The same applies to FetcherOrderedGrouped, which reports something like:
{code}
2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0]
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting
to other_host:13562 with 1 inputs pending
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target
{code}
was:
Currently, a fetch failure is reported like this:
{code}
2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: Fetch
Failure from host while connecting: *other_host*, attempt:
InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0,
pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0,
spillId=-1] Informing ShuffleManager:
java.net.SocketTimeoutException: Read timed out
...
{code}
For debugging network/ssl/etc. issues on cluster, it would be convenient to see
the local host's name in these messages (which is present in the fetcher as
localHostname property), as in the logs collected by yarn cli, it's not obvious
for the first sight.
The same applies to FetcherOrderedGrouped, which reports something like:
{code}
2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0]
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting
to rizhangdebug10-2.gce.cloudera.com:13562 with 1 inputs pending
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException:
PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target
{code}
> Report localHostname in Fetcher failure log messages
> ----------------------------------------------------
>
> Key: TEZ-4097
> URL: https://issues.apache.org/jira/browse/TEZ-4097
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Minor
>
> Currently, a fetch failure is reported like this:
> {code}
> 2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|:
> Fetch Failure from host while connecting: other_host, attempt:
> InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0,
> pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0,
> spillId=-1] Informing ShuffleManager:
> java.net.SocketTimeoutException: Read timed out
> ...
> {code}
> For debugging network/ssl/etc. issues on cluster, it would be convenient to
> see the local host's name in these messages (which is present in the fetcher
> as localHostname property), as in the logs collected by yarn cli, it's not
> obvious for the first sight.
> The same applies to FetcherOrderedGrouped, which reports something like:
> {code}
> 2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0]
> |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after
> connecting to other_host:13562 with 1 inputs pending
> javax.net.ssl.SSLHandshakeException:
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)