[jira] [Commented] (TEZ-3910) Single node can cause Tez job to fail during shuffle

Jonathan Turner Eagles (Jira) Mon, 30 Jun 2025 14:12:06 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987026#comment-17987026
 ]


Jonathan Turner Eagles commented on TEZ-3910:
---------------------------------------------

Getting some more attention on this jira, so went back to the origin of this 
failure to post detailed scenario and logs to understand what this jira sets 
out to fix.

During a job failure analysis, a task failed and the task failure was root 
caused. The first attempt of the failed task failed when an upstream node, the 
REBOOTED_NODE, under question was rebooted. 

All subsequent attempts scheduled on other nodes failed when trying to pull 
some outputs that ran on the same REBOOTED_NODE

Other 100 or so tasks from the same vertex already succeeded before 
REBOOTED_NODE went down.  So the fetch failures happened only for this single 
task.  Not sure if this explains why we were not able to blacklist the source 
(sender) instead of failing the receiver.

Sample stack trace.




{code:java}
2018-02-15 00:07:40,844 [WARN] [Fetcher_O {scope_752} #10] 
|orderedgrouped.FetcherOrderedGrouped|: Failed to connect to 
REBOOTED_NODE:13562 with 1 inputs
java.io.IOException: Failed to connect to 
http://REBOOTED_NODE:13562/mapOutput?job=*************, #connectionFailures=3
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:168)
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:123)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:343)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
        at sun.net.www.http.HttpClient.New(HttpClient.java:308)
        at sun.net.www.http.HttpClient.New(HttpClient.java:326)
        at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
        at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
        at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
        at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
        at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:151)
        ... 11 more
{code}


> Single node can cause Tez job to fail during shuffle
> ----------------------------------------------------
>
>                 Key: TEZ-3910
>                 URL: https://issues.apache.org/jira/browse/TEZ-3910
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3910.001.patch, TEZ-3910.002.patch, 
> TEZ-3910.003.patch, TEZ-3910.004.patch, TEZ-3910.005.patch
>
>
> There is a race where a downstream task that is running into fetch failures 
> due to bad output from the upstream task can continue to blame itself for the 
> failure before the AM can do a re-run of the upstream offending task and fix 
> the fetch failure. This causes the DAG to fail even if a single node fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEZ-3910) Single node can cause Tez job to fail during shuffle

Reply via email to