[ https://issues.apache.org/jira/browse/TEZ-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987026#comment-17987026 ]
Jonathan Turner Eagles commented on TEZ-3910: --------------------------------------------- Getting some more attention on this jira, so went back to the origin of this failure to post detailed scenario and logs to understand what this jira sets out to fix. During a job failure analysis, a task failed and the task failure was root caused. The first attempt of the failed task failed when an upstream node, the REBOOTED_NODE, under question was rebooted. All subsequent attempts scheduled on other nodes failed when trying to pull some outputs that ran on the same REBOOTED_NODE Other 100 or so tasks from the same vertex already succeeded before REBOOTED_NODE went down. So the fetch failures happened only for this single task. Not sure if this explains why we were not able to blacklist the source (sender) instead of failing the receiver. Sample stack trace. {code:java} 2018-02-15 00:07:40,844 [WARN] [Fetcher_O {scope_752} #10] |orderedgrouped.FetcherOrderedGrouped|: Failed to connect to REBOOTED_NODE:13562 with 1 inputs java.io.IOException: Failed to connect to http://REBOOTED_NODE:13562/mapOutput?job=*************, #connectionFailures=3 at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:168) at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:123) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:343) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:151) ... 11 more {code} > Single node can cause Tez job to fail during shuffle > ---------------------------------------------------- > > Key: TEZ-3910 > URL: https://issues.apache.org/jira/browse/TEZ-3910 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.9.1 > Reporter: Kuhu Shukla > Assignee: Kuhu Shukla > Priority: Major > Attachments: TEZ-3910.001.patch, TEZ-3910.002.patch, > TEZ-3910.003.patch, TEZ-3910.004.patch, TEZ-3910.005.patch > > > There is a race where a downstream task that is running into fetch failures > due to bad output from the upstream task can continue to blame itself for the > failure before the AM can do a re-run of the upstream offending task and fix > the fetch failure. This causes the DAG to fail even if a single node fails. -- This message was sent by Atlassian Jira (v8.20.10#820010)