[ 
https://issues.apache.org/jira/browse/FLINK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807526#comment-17807526
 ] 

Yangze Guo commented on FLINK-34105:
------------------------------------

[~zhuzh] I encountered the same Akka timeout issue in my testing with a 
two-level join job using the same concurrency configuration. Adjusting 
pekko.ask.timeout indeed resolved this problem.

I believe the root cause of this issue is that we moved the serialization and 
compression of ShuffleDescriptorGroup from the RPC main thread to Akka's 
serialization thread. The time spent on this operation is included in the 
process monitored by pekko.ask.timeout. Personally speaking, I consider this as 
an optimization rather than a problem for users because the serialization 
thread is pooled, allowing parallelization of the serialization and compression 
process. Otherwise, each compression are executed sequentially in the main 
thread. This change will speed up job deployment, although for very large jobs, 
users may need to manually adjust the configuration. WDYT?

> Akka timeout happens in TPC-DS benchmarks
> -----------------------------------------
>
>                 Key: FLINK-34105
>                 URL: https://issues.apache.org/jira/browse/FLINK-34105
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0
>            Reporter: Zhu Zhu
>            Assignee: Yangze Guo
>            Priority: Critical
>         Attachments: image-2024-01-16-13-59-45-556.png
>
>
> We noticed akka timeout happens in 10TB TPC-DS benchmarks in 1.19. The 
> problem did not happen in 1.18.0.
> After bisecting, we find the problem was introduced in FLINK-33532.
>  !image-2024-01-16-13-59-45-556.png|width=800! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to