[ 
https://issues.apache.org/jira/browse/TEZ-4460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4460:
------------------------------
    Description: 
We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in 
debugging it.

We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of 
M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez 
0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle 
handler.

{{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
|orderedgrouped.FetcherOrderedGrouped|: }}
{{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
|orderedgrouped.FetcherOrderedGrouped|: Read timed out}}
{{java.net.SocketInputStream.socketRead0(Native Method) }}
{{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }}
{{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
{{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
{{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}}
{{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}}
{{java.io.FilterInputStream.read(FilterInputStream.java:133)}}
{{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}}
{{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}}
{{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}}
{{java.io.DataInputStream.readByte(DataInputStream.java:265)}}
{{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}}
{{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}}
{{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}}
{{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}}
{{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}}
{{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}}
{{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}}
{{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{java.lang.Thread.run(Thread.java:750)]}}
{{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed 
retry it.}}

 

In the node manager logs, i can see and other logs

 

{{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request 
verfied. 
enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}}

 

There could be some issue in closing the connections while upgrading netty 
version.

The case is reproducible every time, let me know if you need more logs or want 
me to check something.

 

 

cc: [~abstractdog] 

 

> Read timed out in shuffle handler - incorrect usage of EMPTY_LAST_CONTENT and 
> channel write
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4460
>                 URL: https://issues.apache.org/jira/browse/TEZ-4460
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.2
>            Reporter: Shailesh Gupta
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.3
>
>         Attachments: nm_logs_6_45-6_53.zip, 
> syslog_attempt_1669954729098_0007_1_01_000003_0.gz, 
> syslog_attempt_1669954729098_0007_1_01_000005_0.gz
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in 
> debugging it.
> We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of 
> M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez 
> 0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle 
> handler.
> {{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
> |orderedgrouped.FetcherOrderedGrouped|: }}
> {{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
> |orderedgrouped.FetcherOrderedGrouped|: Read timed out}}
> {{java.net.SocketInputStream.socketRead0(Native Method) }}
> {{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }}
> {{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
> {{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
> {{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}}
> {{java.io.FilterInputStream.read(FilterInputStream.java:133)}}
> {{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}}
> {{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}}
> {{java.io.DataInputStream.readByte(DataInputStream.java:265)}}
> {{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}}
> {{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}}
> {{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}}
> {{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}}
> {{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
> {{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
> {{java.lang.Thread.run(Thread.java:750)]}}
> {{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
> |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed 
> retry it.}}
>  
> In the node manager logs, i can see and other logs
>  
> {{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request 
> verfied. 
> enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}}
>  
> There could be some issue in closing the connections while upgrading netty 
> version.
> The case is reproducible every time, let me know if you need more logs or 
> want me to check something.
>  
>  
> cc: [~abstractdog] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to