[
https://issues.apache.org/jira/browse/TEZ-4460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643058#comment-17643058
]
László Bodor edited comment on TEZ-4460 at 12/9/22 1:42 PM:
------------------------------------------------------------
I remember I already investigated this problem but never succeeded, would be
happy to see a fix for that, however, I'm not sure how much time I'll have in
the last couple of weeks to investigate
took a look in NM logs, and it's indeed very strange, looking at a remote
client from R:/172.31.27.215:60738
{code}
2022-12-02 06:47:48,263 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] REGISTERED
2022-12-02 06:47:48,263 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] ACTIVE
2022-12-02 06:47:48,263 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] READ: 493B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 112B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 1B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE:
org.apache.tez.auxservices.FadvisedFileRegion@9d8dea6
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 1B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,265 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE:
org.apache.tez.auxservices.FadvisedFileRegion@2f7247db
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 1B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 50B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE:
org.apache.tez.auxservices.FadvisedFileRegion@1772c47
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] WRITE: 0B
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] FLUSH
2022-12-02 06:47:48,267 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] READ COMPLETE
...
nothing for 3 minutes, which is the timeout
...
2022-12-02 06:50:48,368 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] READ COMPLETE
2022-12-02 06:50:48,368 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 ! R:/172.31.27.215:60738] INACTIVE
2022-12-02 06:50:48,369 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 ! R:/172.31.27.215:60738] CLOSE
2022-12-02 06:50:48,369 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 ! R:/172.31.27.215:60738] UNREGISTERED
{code}
a normal channel activity is like this:
{code}
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] REGISTERED
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] ACTIVE
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] READ: 352B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 112B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 1B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 50B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 50B
...writes...
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE:
org.apache.tez.auxservices.FadvisedFileRegion@156fba1b
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 0B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] FLUSH
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] CLOSE
2022-12-02 06:50:48,372 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 ! R:/172.31.27.215:50334] READ COMPLETE
2022-12-02 06:50:48,372 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 ! R:/172.31.27.215:50334] INACTIVE
2022-12-02 06:50:48,372 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 ! R:/172.31.27.215:50334] UNREGISTERED
{code}
was (Author: abstractdog):
I remember I already investigated this problem but never succeeded, would be
happy to see a fix for that, however, I'm not sure how much time I'll have in
the last couple of weeks to investigate
took a look in NM logs, and it's indeed very strange, looking at a remote
client from R:/172.31.27.215:60738
{code}
2022-12-02 06:47:48,266 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] FLUSH
2022-12-02 06:47:48,267 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] READ COMPLETE
...
nothing for 3 minutes, which is the timeout
...
2022-12-02 06:50:48,368 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 - R:/172.31.27.215:60738] READ COMPLETE
2022-12-02 06:50:48,368 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 ! R:/172.31.27.215:60738] INACTIVE
2022-12-02 06:50:48,369 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 ! R:/172.31.27.215:60738] CLOSE
2022-12-02 06:50:48,369 DEBUG logging.LoggingHandler: [id: 0x6a895269,
L:/172.31.30.50:13563 ! R:/172.31.27.215:60738] UNREGISTERED
{code}
a normal channel activity is like this:
{code}
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] REGISTERED
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] ACTIVE
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] READ: 352B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 112B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 1B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 50B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 50B
...writes...
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE:
org.apache.tez.auxservices.FadvisedFileRegion@156fba1b
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] WRITE: 0B
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] FLUSH
2022-12-02 06:50:48,371 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 - R:/172.31.27.215:50334] CLOSE
2022-12-02 06:50:48,372 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 ! R:/172.31.27.215:50334] READ COMPLETE
2022-12-02 06:50:48,372 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 ! R:/172.31.27.215:50334] INACTIVE
2022-12-02 06:50:48,372 DEBUG logging.LoggingHandler: [id: 0xc25cffac,
L:/172.31.30.50:13563 ! R:/172.31.27.215:50334] UNREGISTERED
{code}
> Read timed out in shuffle handler
> ---------------------------------
>
> Key: TEZ-4460
> URL: https://issues.apache.org/jira/browse/TEZ-4460
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.10.2
> Reporter: Shailesh Gupta
> Priority: Major
> Attachments: nm_logs_6_45-6_53.zip,
> syslog_attempt_1669954729098_0007_1_01_000003_0.gz,
> syslog_attempt_1669954729098_0007_1_01_000005_0.gz
>
>
> We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in
> debugging it.
> We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of
> M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez
> 0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle
> handler.
> {{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5]
> |orderedgrouped.FetcherOrderedGrouped|: }}
> {{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5]
> |orderedgrouped.FetcherOrderedGrouped|: Read timed out}}
> {{java.net.SocketInputStream.socketRead0(Native Method) }}
> {{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }}
> {{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
> {{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
> {{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}}
> {{java.io.FilterInputStream.read(FilterInputStream.java:133)}}
> {{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}}
> {{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}}
> {{java.io.DataInputStream.readByte(DataInputStream.java:265)}}
> {{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}}
> {{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}}
> {{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}}
> {{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}}
> {{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
> {{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
> {{java.lang.Thread.run(Thread.java:750)]}}
> {{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5]
> |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed
> retry it.}}
>
> In the node manager logs, i can see and other logs
>
> {{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request
> verfied.
> enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}}
>
> There could be some issue in closing the connections while upgrading netty
> version.
> The case is reproducible every time, let me know if you need more logs or
> want me to check something.
>
>
> cc: [~abstractdog]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)