[
https://issues.apache.org/jira/browse/FLINK-22153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322196#comment-17322196
]
Till Rohrmann commented on FLINK-22153:
---------------------------------------
I've tested the feature locally and from what I can tell it worked well. Since
I tested the feature locally, I did not scale it to very large sizes. I assume
that this kind of testing has been done while developing this feature.
I made a few observations:
* It is not clear when the sort merge shuffle is used from the logs (not even
on DEBUG)
** Only the warning that not enough buffers are configured
* With the default values, warnings that not enough sort buffers are configured
are printed: "Too few sort buffers, please increase Key:
'taskmanager.network.sort-shuffle.min-buffers' , default: 64 (fallback keys:
[]) to a larger value (more than 512)"
** Warning is logged for every ResultPartition, this is quite noisy
* Good exception message if
{{taskmanager.memory.framework.off-heap.batch-shuffle.size}} is configured too
high and exceeds the {{taskmanager.memory.framework.off-heap.size}}.
I am wondering whether we should set the default
{{taskmanager.network.sort-shuffle.min-buffers}} high enough so that no warning
is logged?
In the documentation, we should also mention from which budget the memory is
taken.
> Manually test the sort-merge blocking shuffle
> ---------------------------------------------
>
> Key: FLINK-22153
> URL: https://issues.apache.org/jira/browse/FLINK-22153
> Project: Flink
> Issue Type: Task
> Components: Runtime / Network
> Affects Versions: 1.13.0
> Reporter: Yingjie Cao
> Assignee: Till Rohrmann
> Priority: Blocker
> Fix For: 1.13.0
>
>
> In 1.12, we introduced sort-merge blocking shuffle to Flink and in 1.13, the
> feature was optimized which improves the usability (fix direct memory OOM
> issue) and performance (introduce IO scheduling and broadcast optimization).
> The sort-merge blocking shuffle can be tested following the bellow process:
> # Write a simple batch job using either sql/table or DataStream API; (Word
> count should be enough)
> # Enable sort-merge blocking shuffle by setting
> taskmanager.network.sort-shuffle.min-parallelism to 1 in the Flink
> configuration file;
> # Submit and run the batch job with different parallelism and data volume;
> # Tune the relevant config options
> (taskmanager.network.blocking-shuffle.compression.enabled,
> taskmanager.network.sort-shuffle.min-buffers,
> taskmanager.memory.framework.off-heap.batch-shuffle.size) and see the
> influence.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)