[ 
https://issues.apache.org/jira/browse/FLINK-22153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322196#comment-17322196
 ] 

Till Rohrmann commented on FLINK-22153:
---------------------------------------

I've tested the feature locally and from what I can tell it worked well. Since 
I tested the feature locally, I did not scale it to very large sizes. I assume 
that this kind of testing has been done while developing this feature.

I made a few observations:

* It is not clear when the sort merge shuffle is used from the logs (not even 
on DEBUG)
** Only the warning that not enough buffers are configured
* With the default values, warnings that not enough sort buffers are configured 
are printed: "Too few sort buffers, please increase Key: 
'taskmanager.network.sort-shuffle.min-buffers' , default: 64 (fallback keys: 
[]) to a larger value (more than 512)"
** Warning is logged for every ResultPartition, this is quite noisy

* Good exception message if 
{{taskmanager.memory.framework.off-heap.batch-shuffle.size}} is configured too 
high and exceeds the {{taskmanager.memory.framework.off-heap.size}}.

I am wondering whether we should set the default 
{{taskmanager.network.sort-shuffle.min-buffers}} high enough so that no warning 
is logged?

In the documentation, we should also mention from which budget the memory is 
taken.

> Manually test the sort-merge blocking shuffle
> ---------------------------------------------
>
>                 Key: FLINK-22153
>                 URL: https://issues.apache.org/jira/browse/FLINK-22153
>             Project: Flink
>          Issue Type: Task
>          Components: Runtime / Network
>    Affects Versions: 1.13.0
>            Reporter: Yingjie Cao
>            Assignee: Till Rohrmann
>            Priority: Blocker
>             Fix For: 1.13.0
>
>
> In 1.12, we introduced sort-merge blocking shuffle to Flink and in 1.13, the 
> feature was optimized which improves the usability (fix direct memory OOM 
> issue) and performance (introduce IO scheduling and broadcast optimization).
> The sort-merge blocking shuffle can be tested following the bellow process:
>  # Write a simple batch job using either sql/table or DataStream API; (Word 
> count should be enough)
>  # Enable sort-merge blocking shuffle by setting 
> taskmanager.network.sort-shuffle.min-parallelism to 1 in the Flink 
> configuration file;
>  # Submit and run the batch job with different parallelism and data volume;
>  # Tune the relevant config options 
> (taskmanager.network.blocking-shuffle.compression.enabled, 
> taskmanager.network.sort-shuffle.min-buffers, 
> taskmanager.memory.framework.off-heap.batch-shuffle.size) and see the 
> influence. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to