[
https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807526#comment-17807526
]
Yangze Guo commented on FLINK-34105:
------------------------------------
[~zhuzh] I encountered the same Akka timeout issue in my testing with a
two-level join job using the same concurrency configuration. Adjusting
pekko.ask.timeout indeed resolved this problem.
I believe the root cause of this issue is that we moved the serialization and
compression of ShuffleDescriptorGroup from the RPC main thread to Akka's
serialization thread. The time spent on this operation is included in the
process monitored by pekko.ask.timeout. Personally speaking, I consider this as
an optimization rather than a problem for users because the serialization
thread is pooled, allowing parallelization of the serialization and compression
process. Otherwise, each compression are executed sequentially in the main
thread. This change will speed up job deployment, although for very large jobs,
users may need to manually adjust the configuration. WDYT?
> Akka timeout happens in TPC-DS benchmarks
> -----------------------------------------
>
> Key: FLINK-34105
> URL: https://issues.apache.org/jira/browse/FLINK-34105
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.19.0
> Reporter: Zhu Zhu
> Assignee: Yangze Guo
> Priority: Critical
> Attachments: image-2024-01-16-13-59-45-556.png
>
>
> We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The
> problem did not happen in 1.18.0.
> After bisecting, we find the problem was introduced in FLINK-33532.
> !image-2024-01-16-13-59-45-556.png|width=800!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)