[
https://issues.apache.org/jira/browse/TEZ-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated TEZ-4348:
------------------------------
Description:
The idea is the same as in TEZ-4336, this is for unordered codepaths.
An example with a problem attached as
[^org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt]
this was discovered while I was working on a hive ticket:
1. qtest failed
2. there were no obvious hive related error
3. tons of messages in the logs like below:
{code}
2024-07-26T00:21:36,900 INFO [Fetcher_B {Map_1 -> Reducer_2} #0]
impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src:
InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0,
pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0,
spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0,
attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129,
spillType=0, spillId=-1], connectFailed: true, local fetch: false, remote fetch
failure reported as local failure: false)
{code}
4. after placing a log message to ShuffleManager I found the following:
{code}
2024-07-25T03:28:15,352 WARN [Fetcher_B {Map_1 -> Reducer_2} #0]
impl.ShuffleManager: Fetch failure
java.io.IOException: Failed to connect to
http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129,
#connectionFailures=1
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
~[guava-28.2-jre.jar:?]
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
~[guava-28.2-jre.jar:?]
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
~[guava-28.2-jre.jar:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_292]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.net.ConnectException: Can't assign requested address (connect
failed)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292]
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
~[?:1.8.0_292]
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
~[?:1.8.0_292]
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
~[?:1.8.0_292]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
~[?:1.8.0_292]
at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292]
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292]
at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
~[?:1.8.0_292]
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
... 13 more
{code}
this eventually led to DAG failure
the expected behavior is:
1. log the exception and/or...
2. report the exception to the AM so it can report it on DAG failure
was:
The idea is the same as in TEZ-4336, this is for unordered codepaths.
An example with a problem attached as
[^org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt]
this was discovered while I was working on a hive ticket:
1. qtest failed
2. there were no obvious hive related error
3. tons of messages in the logs like below:
{code}
2024-07-26T00:21:36,900 INFO [Fetcher_B {Map_1 -> Reducer_2} #0]
impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src:
InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0,
pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0,
spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0,
attemptNumber=0, pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129,
spillType=0, spillId=-1], connectFailed: true, local fetch: false, remote fetch
failure reported as local failure: false)
{code}
4. after putting a log message I found:
{code}
2024-07-25T03:28:15,352 WARN [Fetcher_B {Map_1 -> Reducer_2} #0]
impl.ShuffleManager: Fetch failure
java.io.IOException: Failed to connect to
http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129,
#connectionFailures=1
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
~[guava-28.2-jre.jar:?]
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
~[guava-28.2-jre.jar:?]
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
~[guava-28.2-jre.jar:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_292]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
Caused by: java.net.ConnectException: Can't assign requested address (connect
failed)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292]
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
~[?:1.8.0_292]
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
~[?:1.8.0_292]
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
~[?:1.8.0_292]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
~[?:1.8.0_292]
at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292]
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
~[?:1.8.0_292]
at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292]
at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
~[?:1.8.0_292]
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
~[?:1.8.0_292]
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149)
~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
... 13 more
{code}
this eventually led to DAG failure
the expected behavior is:
1. log the exception and/or...
2. report the exception to the AM so it can report it on DAG failure
> ShuffleManager should try to report the original exception
> ----------------------------------------------------------
>
> Key: TEZ-4348
> URL: https://issues.apache.org/jira/browse/TEZ-4348
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments:
> org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt
>
>
> The idea is the same as in TEZ-4336, this is for unordered codepaths.
> An example with a problem attached as
> [^org.apache.hadoop.hive.cli.TestMiniLlapCliDriver-output.txt]
> this was discovered while I was working on a hive ticket:
> 1. qtest failed
> 2. there were no obvious hive related error
> 3. tons of messages in the logs like below:
> {code}
> 2024-07-26T00:21:36,900 INFO [Fetcher_B {Map_1 -> Reducer_2} #0]
> impl.ShuffleManager: Map_1 -> Reducer_2: Fetch failed for src:
> InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0,
> pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0,
> spillId=-1] InputIdentifier: InputAttemptIdentifier [inputIdentifier=0,
> attemptNumber=0,
> pathComponent=attempt_1721978473743_0001_8_00_000000_0_10129, spillType=0,
> spillId=-1], connectFailed: true, local fetch: false, remote fetch failure
> reported as local failure: false)
> {code}
> 4. after placing a log message to ShuffleManager I found the following:
> {code}
> 2024-07-25T03:28:15,352 WARN [Fetcher_B {Map_1 -> Reducer_2} #0]
> impl.ShuffleManager: Fetch failure
> java.io.IOException: Failed to connect to
> http://lbodor-MBP16.local:0/mapOutput?job=job_1721903278713_0001&dag=8&reduce=0&map=attempt_1721903278713_0001_8_00_000000_0_10129,
> #connectionFailures=1
> at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:166)
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:121)
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:505)
>
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:574)
>
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:493)
>
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:291)
>
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
>
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> ~[tez-common-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
> ~[guava-28.2-jre.jar:?]
> at
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
> ~[guava-28.2-jre.jar:?]
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
> ~[guava-28.2-jre.jar:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_292]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_292]
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
> Caused by: java.net.ConnectException: Can't assign requested address (connect
> failed)
> at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292]
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> ~[?:1.8.0_292]
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> ~[?:1.8.0_292]
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> ~[?:1.8.0_292]
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> ~[?:1.8.0_292]
> at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292]
> at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
> ~[?:1.8.0_292]
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
> ~[?:1.8.0_292]
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
> ~[?:1.8.0_292]
> at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
> ~[?:1.8.0_292]
> at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292]
> at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292]
> at
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
> ~[?:1.8.0_292]
> at
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
> ~[?:1.8.0_292]
> at
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
> ~[?:1.8.0_292]
> at
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
> ~[?:1.8.0_292]
> at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:149)
> ~[tez-runtime-library-0.9.1.2024.0.19.0-SNAPSHOT.jar:0.9.1.2024.0.19.0-SNAPSHOT]
> ... 13 more
> {code}
> this eventually led to DAG failure
> the expected behavior is:
> 1. log the exception and/or...
> 2. report the exception to the AM so it can report it on DAG failure
--
This message was sent by Atlassian Jira
(v8.20.10#820010)