[jira] [Comment Edited] (TEZ-4460) Read timed out in shuffle handler

Shailesh Gupta (Jira) Fri, 02 Dec 2022 02:16:06 -0800


    [ 
https://issues.apache.org/jira/browse/TEZ-4460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642415#comment-17642415
 ]


Shailesh Gupta edited comment on TEZ-4460 at 12/2/22 10:15 AM:
---------------------------------------------------------------

Ran a simpler query on the tpcds dataset.

select count(*), inv_date_sk from inventory group by inv_date_sk;

It creates 2 stages and reducer stages are taking printing below errors. There 
is shuffle read timeout which is configured for 3 min and we can see a retry 
attempt for the shuffle data.

Logs:

container_1669954729098_0007_01_000003/syslog_attempt_1669954729098_0007_1_01_000005_0
{quote}2022-12-02 06:50:48,368 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #2|#2] 
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from 
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
 

container_1669954729098_0007_01_000006/syslog_attempt_1669954729098_0007_1_01_000003_0
{quote}2022-12-02 06:50:48,527 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #3|#3] 
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from 
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
attempt_1669954729098_0007_1_00_000094_0_10005
{quote}2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: Retrieved 
pathInfo for AttemptPathIdentifier\{jobId='job_1669954729098_0007', dagId='1', 
user='hadoop', attemptId='attempt_1669954729098_0007_1_00_000094_0_10005'} 
check for corresponding loaded messages to determine whether it was loaded or 
cached
2022-12-02 06:47:48,421 DEBUG auxservices.IndexCache: IndexCache MISS: MapId 
attempt_1669954729098_0007_1_00_000094_0_10005 not found
2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: getMapOutputInfo: 
jobId=job_1669954729098_0007, 
mapId=attempt_1669954729098_0007_1_00_000094_0_10005,dataFile=/mnt2/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out,
 
indexFile=/mnt1/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out.index
2022-12-02 06:50:48,529 DEBUG auxservices.ShuffleHandler: RECV: 
/mapOutput?job=job_1669954729098_0007&dag=1&reduce=24-31&map=attempt_1669954729098_0007_1_00_000094_0_10005
  mapId: [attempt_1669954729098_0007_1_00_000094_0_10005]
  reduceId: range: 24-31
  jobId: [job_1669954729098_0007]
  dagId: [1]
  keepAlive: false
{quote}
 

Attaching nm and tez task logs. 

Note: You wonot find the stacktrace provided in desc here and we have made a 
temp change in the code to print stracktrace. 

 

 


was (Author: guptashailesh92):
Ran a simpler query on the tpcds dataset.

select count(*), inv_date_sk from inventory group by inv_date_sk;

It creates 2 stages and reducer stages are taking printing below errors. There 
is shuffle read timeout which is configured for 3 min and we can see a retry 
attempt for the shuffle data.

Logs:

container_1669954729098_0007_01_000003/syslog_attempt_1669954729098_0007_1_01_000005_0
{quote}2022-12-02 06:50:48,368 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #2] 
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from 
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
 

container_1669954729098_0007_01_000006/syslog_attempt_1669954729098_0007_1_01_000003_0
{quote}2022-12-02 06:50:48,527 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #3] 
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from 
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
attempt_1669954729098_0007_1_00_000094_0_10005
{quote}2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: Retrieved 
pathInfo for AttemptPathIdentifier\{jobId='job_1669954729098_0007', dagId='1', 
user='hadoop', attemptId='attempt_1669954729098_0007_1_00_000094_0_10005'} 
check for corresponding loaded messages to determine whether it was loaded or 
cached
2022-12-02 06:47:48,421 DEBUG auxservices.IndexCache: IndexCache MISS: MapId 
attempt_1669954729098_0007_1_00_000094_0_10005 not found
2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: getMapOutputInfo: 
jobId=job_1669954729098_0007, 
mapId=attempt_1669954729098_0007_1_00_000094_0_10005,dataFile=/mnt2/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out,
 
indexFile=/mnt1/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out.index
2022-12-02 06:50:48,529 DEBUG auxservices.ShuffleHandler: RECV: 
/mapOutput?job=job_1669954729098_0007&dag=1&reduce=24-31&map=attempt_1669954729098_0007_1_00_000094_0_10005
  mapId: [attempt_1669954729098_0007_1_00_000094_0_10005]
  reduceId: range: 24-31
  jobId: [job_1669954729098_0007]
  dagId: [1]
  keepAlive: false
{quote}
 

Attaching nm and tez task logs. 

Node: You wonot find the stacktrace provided in desc here and we have made a 
temp change in the code to print stracktrace. 

 

> Read timed out in shuffle handler
> ---------------------------------
>
>                 Key: TEZ-4460
>                 URL: https://issues.apache.org/jira/browse/TEZ-4460
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.2
>            Reporter: Shailesh Gupta
>            Priority: Major
>         Attachments: nm_logs_6_45-6_53.zip, 
> syslog_attempt_1669954729098_0007_1_01_000003_0.gz, 
> syslog_attempt_1669954729098_0007_1_01_000005_0.gz
>
>
> We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in 
> debugging it.
> We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of 
> M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez 
> 0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle 
> handler.
> {{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
> |orderedgrouped.FetcherOrderedGrouped|: }}
> {{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
> |orderedgrouped.FetcherOrderedGrouped|: Read timed out}}
> {{java.net.SocketInputStream.socketRead0(Native Method) }}
> {{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }}
> {{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
> {{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
> {{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}}
> {{java.io.FilterInputStream.read(FilterInputStream.java:133)}}
> {{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}}
> {{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}}
> {{java.io.DataInputStream.readByte(DataInputStream.java:265)}}
> {{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}}
> {{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}}
> {{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}}
> {{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}}
> {{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
> {{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
> {{java.lang.Thread.run(Thread.java:750)]}}
> {{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] 
> |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed 
> retry it.}}
>  
> In the node manager logs, i can see and other logs
>  
> {{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request 
> verfied. 
> enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}}
>  
> There could be some issue in closing the connections while upgrading netty 
> version.
> The case is reproducible every time, let me know if you need more logs or 
> want me to check something.
>  
>  
> cc: [~abstractdog] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TEZ-4460) Read timed out in shuffle handler

Reply via email to