[ 
https://issues.apache.org/jira/browse/TEZ-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4097:
------------------------------
    Description: 
Currently, a fetch failure is reported like this:
{code}
2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: Fetch 
Failure from host while connecting: other_host, attempt: InputAttemptIdentifier 
[inputIdentifier=1, attemptNumber=0, 
pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0, 
spillId=-1] Informing ShuffleManager:
java.net.SocketTimeoutException: Read timed out
...
{code}

For debugging network/ssl/etc. issues on cluster, it would be convenient to see 
the local host's name in these messages (which is present in the fetcher as 
localHostname property), as in the logs collected by yarn cli, it's not obvious 
for the first sight.

The same applies to FetcherOrderedGrouped, which reports something like:
{code}
2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0] 
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting 
to other_host:13562 with 1 inputs pending
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: 
PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target
{code}

  was:
Currently, a fetch failure is reported like this:
{code}
2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: Fetch 
Failure from host while connecting: *other_host*, attempt: 
InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0, 
pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0, 
spillId=-1] Informing ShuffleManager:
java.net.SocketTimeoutException: Read timed out
...
{code}

For debugging network/ssl/etc. issues on cluster, it would be convenient to see 
the local host's name in these messages (which is present in the fetcher as 
localHostname property), as in the logs collected by yarn cli, it's not obvious 
for the first sight.

The same applies to FetcherOrderedGrouped, which reports something like:
{code}
2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0] 
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting 
to rizhangdebug10-2.gce.cloudera.com:13562 with 1 inputs pending
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: 
PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target
{code}


> Report localHostname in Fetcher failure log messages
> ----------------------------------------------------
>
>                 Key: TEZ-4097
>                 URL: https://issues.apache.org/jira/browse/TEZ-4097
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Minor
>
> Currently, a fetch failure is reported like this:
> {code}
> 2019-11-05 02:50:35,972 [WARN] [Fetcher_B {Map_4} #1] |shuffle.Fetcher|: 
> Fetch Failure from host while connecting: other_host, attempt: 
> InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0, 
> pathComponent=attempt_1572936153637_0005_1_00_000000_0_10003, spillType=0, 
> spillId=-1] Informing ShuffleManager:
> java.net.SocketTimeoutException: Read timed out
> ...
> {code}
> For debugging network/ssl/etc. issues on cluster, it would be convenient to 
> see the local host's name in these messages (which is present in the fetcher 
> as localHostname property), as in the logs collected by yarn cli, it's not 
> obvious for the first sight.
> The same applies to FetcherOrderedGrouped, which reports something like:
> {code}
> 2019-11-05 03:13:11,046 [WARN] [Fetcher_O {Map_1} #0] 
> |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after 
> connecting to other_host:13562 with 1 inputs pending
> javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to