[
https://issues.apache.org/jira/browse/TEZ-4460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642415#comment-17642415
]
Shailesh Gupta edited comment on TEZ-4460 at 12/2/22 10:15 AM:
---------------------------------------------------------------
Ran a simpler query on the tpcds dataset.
select count(*), inv_date_sk from inventory group by inv_date_sk;
It creates 2 stages and reducer stages are taking printing below errors. There
is shuffle read timeout which is configured for 3 min and we can see a retry
attempt for the shuffle data.
Logs:
container_1669954729098_0007_01_000003/syslog_attempt_1669954729098_0007_1_01_000005_0
{quote}2022-12-02 06:50:48,368 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #2|#2]
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
container_1669954729098_0007_01_000006/syslog_attempt_1669954729098_0007_1_01_000003_0
{quote}2022-12-02 06:50:48,527 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #3|#3]
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
attempt_1669954729098_0007_1_00_000094_0_10005
{quote}2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: Retrieved
pathInfo for AttemptPathIdentifier\{jobId='job_1669954729098_0007', dagId='1',
user='hadoop', attemptId='attempt_1669954729098_0007_1_00_000094_0_10005'}
check for corresponding loaded messages to determine whether it was loaded or
cached
2022-12-02 06:47:48,421 DEBUG auxservices.IndexCache: IndexCache MISS: MapId
attempt_1669954729098_0007_1_00_000094_0_10005 not found
2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: getMapOutputInfo:
jobId=job_1669954729098_0007,
mapId=attempt_1669954729098_0007_1_00_000094_0_10005,dataFile=/mnt2/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out,
indexFile=/mnt1/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out.index
2022-12-02 06:50:48,529 DEBUG auxservices.ShuffleHandler: RECV:
/mapOutput?job=job_1669954729098_0007&dag=1&reduce=24-31&map=attempt_1669954729098_0007_1_00_000094_0_10005
mapId: [attempt_1669954729098_0007_1_00_000094_0_10005]
reduceId: range: 24-31
jobId: [job_1669954729098_0007]
dagId: [1]
keepAlive: false
{quote}
Attaching nm and tez task logs.
Note: You wonot find the stacktrace provided in desc here and we have made a
temp change in the code to print stracktrace.
was (Author: guptashailesh92):
Ran a simpler query on the tpcds dataset.
select count(*), inv_date_sk from inventory group by inv_date_sk;
It creates 2 stages and reducer stages are taking printing below errors. There
is shuffle read timeout which is configured for 3 min and we can see a retry
attempt for the shuffle data.
Logs:
container_1669954729098_0007_01_000003/syslog_attempt_1669954729098_0007_1_01_000005_0
{quote}2022-12-02 06:50:48,368 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #2]
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
container_1669954729098_0007_01_000006/syslog_attempt_1669954729098_0007_1_01_000003_0
{quote}2022-12-02 06:50:48,527 [WARN] [Fetcher_O \{Map_1 -> Reducer_2} #3]
|orderedgrouped.FetcherOrderedGrouped|: Shuffle output from
ip-172-31-30-50.ec2.internal:13563 failed, retry it.
{quote}
attempt_1669954729098_0007_1_00_000094_0_10005
{quote}2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: Retrieved
pathInfo for AttemptPathIdentifier\{jobId='job_1669954729098_0007', dagId='1',
user='hadoop', attemptId='attempt_1669954729098_0007_1_00_000094_0_10005'}
check for corresponding loaded messages to determine whether it was loaded or
cached
2022-12-02 06:47:48,421 DEBUG auxservices.IndexCache: IndexCache MISS: MapId
attempt_1669954729098_0007_1_00_000094_0_10005 not found
2022-12-02 06:47:48,421 DEBUG auxservices.ShuffleHandler: getMapOutputInfo:
jobId=job_1669954729098_0007,
mapId=attempt_1669954729098_0007_1_00_000094_0_10005,dataFile=/mnt2/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out,
indexFile=/mnt1/yarn/usercache/hadoop/appcache/application_1669954729098_0007/dag_1/output/attempt_1669954729098_0007_1_00_000094_0_10005/file.out.index
2022-12-02 06:50:48,529 DEBUG auxservices.ShuffleHandler: RECV:
/mapOutput?job=job_1669954729098_0007&dag=1&reduce=24-31&map=attempt_1669954729098_0007_1_00_000094_0_10005
mapId: [attempt_1669954729098_0007_1_00_000094_0_10005]
reduceId: range: 24-31
jobId: [job_1669954729098_0007]
dagId: [1]
keepAlive: false
{quote}
Attaching nm and tez task logs.
Node: You wonot find the stacktrace provided in desc here and we have made a
temp change in the code to print stracktrace.
> Read timed out in shuffle handler
> ---------------------------------
>
> Key: TEZ-4460
> URL: https://issues.apache.org/jira/browse/TEZ-4460
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.10.2
> Reporter: Shailesh Gupta
> Priority: Major
> Attachments: nm_logs_6_45-6_53.zip,
> syslog_attempt_1669954729098_0007_1_01_000003_0.gz,
> syslog_attempt_1669954729098_0007_1_01_000005_0.gz
>
>
> We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in
> debugging it.
> We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of
> M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez
> 0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle
> handler.
> {{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5]
> |orderedgrouped.FetcherOrderedGrouped|: }}
> {{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5]
> |orderedgrouped.FetcherOrderedGrouped|: Read timed out}}
> {{java.net.SocketInputStream.socketRead0(Native Method) }}
> {{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }}
> {{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
> {{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
> {{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}}
> {{java.io.FilterInputStream.read(FilterInputStream.java:133)}}
> {{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}}
> {{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}}
> {{java.io.DataInputStream.readByte(DataInputStream.java:265)}}
> {{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}}
> {{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}}
> {{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}}
> {{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}}
> {{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
> {{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
> {{java.lang.Thread.run(Thread.java:750)]}}
> {{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5]
> |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed
> retry it.}}
>
> In the node manager logs, i can see and other logs
>
> {{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request
> verfied.
> enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}}
>
> There could be some issue in closing the connections while upgrading netty
> version.
> The case is reproducible every time, let me know if you need more logs or
> want me to check something.
>
>
> cc: [~abstractdog]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)