[ https://issues.apache.org/jira/browse/TEZ-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987033#comment-17987033 ]
Jonathan Turner Eagles commented on TEZ-3910: --------------------------------------------- Another earlier stack trace associate with this error. In this case the upstream host, READTIMEDOUT_HOST, was giving read timed out errors. Task blamed itself instead of correctly blaming the upstream. {code:java} 2017-07-11 13:15:11,575 [INFO] [Fetcher_O {scope_334} #2] |HttpConnection.url|: for url=http://READTIMEDOUT_HOST:8043/mapOutput?job=******** sent hash and receievd reply 0 ms 2017-07-11 13:29:11,661 [INFO] [Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|: Failed to read data to memory for InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0, pathComponent=attempt_********, spillType=0, spillId=-1]. len=6251502, decomp=28568387. ExceptionMessage=Read timed out 2017-07-11 13:29:11,661 [WARN] [Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from READTIMEDOUT_HOST:8043 failed, retry it. 2017-07-11 13:32:11,763 [WARN] [Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting to READTIMEDOUT_HOST:8043 with 1 inputs pending java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:351) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:292) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-07-11 13:32:11,766 [INFO] [Fetcher_O {scope_334} #2] |orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0, pathComponent=attempt_********, spillType=0, spillId=-1], numUniqueHosts=3, hostFailureThreshold=3, hostFailuresCount=0, hosts crossing threshold=0, reducerFetchIssues=false 2017-07-11 13:32:11,767 [WARN] [Fetcher_O {scope_334} #2] |orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks [InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0, pathComponent=attempt_********, spillType=0, spillId=-1]] 2017-07-11 13:32:11,767 [INFO] [Fetcher_O {scope_334} #2] |orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier [inputIdentifier=2, attemptNumber=0, pathComponent=attempt_********, spillType=0, spillId=-1], numUniqueHosts=3, hostFailureThreshold=3, hostFailuresCount=1, hosts crossing threshold=0, reducerFetchIssues=false 2017-07-11 13:35:14,467 [WARN] [Fetcher_O {scope_334} #0] |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting to READTIMEDOUT_HOST:8043 with 1 inputs pending java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:351) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-07-11 13:35:14,467 [ERROR] [Fetcher_O {scope_334} #0] |orderedgrouped.ShuffleScheduler|: scope_334: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true 2017-07-11 13:35:14,468 [INFO] [Fetcher_O {scope_334} #0] |orderedgrouped.Shuffle|: scope_334: Setting throwable in reportException with message [scope_334: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true] from thread [Fetcher_O {scope_334} #0 2017-07-11 13:35:14,468 [INFO] [Fetcher_O {scope_334} #0] |orderedgrouped.ShuffleScheduler|: copy(3 (spillsFetched=3) of 4. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 2017-07-11 13:35:14,468 [INFO] [Fetcher_O {scope_334} #0] |orderedgrouped.ShuffleScheduler|: Shutting down fetchers for input: scope_334, shutdown timetaken: 0 ms, hasFetcherExecutorStopped: true 2017-07-11 13:35:14,468 [INFO] [ShuffleAndMergeRunner {scope_334}] |orderedgrouped.ShuffleScheduler|: scope_334: Interrupted while waiting for host and hasBeenShutdown. Breaking out of ShuffleSchedulerCallable loop 2017-07-11 13:35:14,469 [INFO] [ShuffleAndMergeRunner {scope_334}] |orderedgrouped.ShuffleScheduler|: Shutting down FetchScheduler for input: scope_334, wasInterrupted=true 2017-07-11 13:35:14,469 [INFO] [Fetcher_O {scope_334} #0] |orderedgrouped.ShuffleScheduler|: scope_334: Already shutdown. Ignoring fetch complete 2017-07-11 13:35:14,469 [ERROR] [ShuffleAndMergeRunner {scope_334}] |orderedgrouped.Shuffle|: scope_334: ShuffleRunner failed with error org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {scope_334} #0 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:304) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:286) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: scope_334: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1021) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:762) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:379) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56) ... 5 more {code} > Single node can cause Tez job to fail during shuffle > ---------------------------------------------------- > > Key: TEZ-3910 > URL: https://issues.apache.org/jira/browse/TEZ-3910 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.9.1 > Reporter: Kuhu Shukla > Assignee: Kuhu Shukla > Priority: Major > Attachments: TEZ-3910.001.patch, TEZ-3910.002.patch, > TEZ-3910.003.patch, TEZ-3910.004.patch, TEZ-3910.005.patch > > > There is a race where a downstream task that is running into fetch failures > due to bad output from the upstream task can continue to blame itself for the > failure before the AM can do a re-run of the upstream offending task and fix > the fetch failure. This causes the DAG to fail even if a single node fails. -- This message was sent by Atlassian Jira (v8.20.10#820010)