[GitHub] [flink] xintongsong commented on a diff in pull request #21890: [FLINK-30860][doc] Add document for hybrid shuffle with adaptive batch scheduler

via GitHub Wed, 08 Feb 2023 23:12:54 -0800


xintongsong commented on code in PR #21890:
URL: https://github.com/apache/flink/pull/21890#discussion_r1101054167



##########
docs/content.zh/docs/ops/batch/batch_shuffle.md:
##########
@@ -114,12 +114,34 @@ Hybrid shuffle provides two spilling strategies:
 
 To use hybrid shuffle mode, you need to configure the 
[execution.batch-shuffle-mode]({{< ref "docs/deployment/config" 
>}}#execution-batch-shuffle-mode) to `ALL_EXCHANGES_HYBRID_FULL` (full spilling 
strategy) or `ALL_EXCHANGES_HYBRID_SELECTIVE` (selective spilling strategy).
 
+#### Supports AdaptiveBatchScheduler and SpeculativeExecution
+
+Hybrid shuffle currently supports `AdaptiveBatchScheduler` by default. If you 
want to use `DefaultScheduler`, please configure the [jobmanager.scheduler]({{< 
ref "docs/deployment/config" >}}#jobmanager-scheduler) to `DefaultScheduler`. 
See [elastic_scaling]({{< ref "docs/deployment/elastic_scaling" 
>}}#adaptive-batch-scheduler) for details.
+
+If you want to enable `SpeculativeExecution` in the same time, see 
[speculative_execution]({{< ref "docs/deployment/speculative_execution" >}}) 
for details.

Review Comment:
   This is irrelevant to hybrid shuffle.



##########
docs/content.zh/docs/ops/batch/batch_shuffle.md:
##########
@@ -114,12 +114,34 @@ Hybrid shuffle provides two spilling strategies:
 
 To use hybrid shuffle mode, you need to configure the 
[execution.batch-shuffle-mode]({{< ref "docs/deployment/config" 
>}}#execution-batch-shuffle-mode) to `ALL_EXCHANGES_HYBRID_FULL` (full spilling 
strategy) or `ALL_EXCHANGES_HYBRID_SELECTIVE` (selective spilling strategy).
 
+#### Supports AdaptiveBatchScheduler and SpeculativeExecution
+
+Hybrid shuffle currently supports `AdaptiveBatchScheduler` by default. If you 
want to use `DefaultScheduler`, please configure the [jobmanager.scheduler]({{< 
ref "docs/deployment/config" >}}#jobmanager-scheduler) to `DefaultScheduler`. 
See [elastic_scaling]({{< ref "docs/deployment/elastic_scaling" 
>}}#adaptive-batch-scheduler) for details.
+
+If you want to enable `SpeculativeExecution` in the same time, see 
[speculative_execution]({{< ref "docs/deployment/speculative_execution" >}}) 
for details.
+
+Hybrid shuffle divides the partition data consumption constraints between 
producer and consumer into the following three cases:
+
+- **ALL_PRODUCERS_FINISHED** : hybrid partition data can be consumed only when 
all producers are finished.
+- **ONLY_FINISHED_PRODUCERS** : hybrid partition data can be consumed when its 
producer is finished.
+- **UNFINISHED_PRODUCERS** : hybrid partition data can be consumed even if its 
producer is un-finished.
+
+If `SpeculativeExecution` is enabled, the default constraint is 
`ONLY_FINISHED_PRODUCERS` to bring some performance optimization compared with 
blocking shuffle. Otherwise, the default constraint is `UNFINISHED_PRODUCERS` 
to perform pipelined-like shuffle. These could be configured via 
[jobmanager.partition.hybrid.partition-data-consume-constraint]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-partition-hybrid-partition-data-consume-constraint).

Review Comment:
   What is the potential impacts when changing this option?



##########
docs/content.zh/docs/ops/batch/batch_shuffle.md:
##########
@@ -114,12 +114,34 @@ Hybrid shuffle provides two spilling strategies:
 
 To use hybrid shuffle mode, you need to configure the 
[execution.batch-shuffle-mode]({{< ref "docs/deployment/config" 
>}}#execution-batch-shuffle-mode) to `ALL_EXCHANGES_HYBRID_FULL` (full spilling 
strategy) or `ALL_EXCHANGES_HYBRID_SELECTIVE` (selective spilling strategy).
 
+#### Supports AdaptiveBatchScheduler and SpeculativeExecution
+
+Hybrid shuffle currently supports `AdaptiveBatchScheduler` by default. If you 
want to use `DefaultScheduler`, please configure the [jobmanager.scheduler]({{< 
ref "docs/deployment/config" >}}#jobmanager-scheduler) to `DefaultScheduler`. 
See [elastic_scaling]({{< ref "docs/deployment/elastic_scaling" 
>}}#adaptive-batch-scheduler) for details.
+
+If you want to enable `SpeculativeExecution` in the same time, see 
[speculative_execution]({{< ref "docs/deployment/speculative_execution" >}}) 
for details.
+
+Hybrid shuffle divides the partition data consumption constraints between 
producer and consumer into the following three cases:
+
+- **ALL_PRODUCERS_FINISHED** : hybrid partition data can be consumed only when 
all producers are finished.
+- **ONLY_FINISHED_PRODUCERS** : hybrid partition data can be consumed when its 
producer is finished.
+- **UNFINISHED_PRODUCERS** : hybrid partition data can be consumed even if its 
producer is un-finished.
+
+If `SpeculativeExecution` is enabled, the default constraint is 
`ONLY_FINISHED_PRODUCERS` to bring some performance optimization compared with 
blocking shuffle. Otherwise, the default constraint is `UNFINISHED_PRODUCERS` 
to perform pipelined-like shuffle. These could be configured via 
[jobmanager.partition.hybrid.partition-data-consume-constraint]({{< ref 
"docs/deployment/config" 
>}}#jobmanager-partition-hybrid-partition-data-consume-constraint).
+
+#### Index Spilling
+
+Hybrid shuffle indexes the shuffle data in memory and disk. Generally 
speaking, all index can be cached in memory to speed up index retrieval. 
However, for large batch jobs, this part of memory may bring OOM risks.
+Therefore, hybrid shuffle supports spilling index data to disk. The following 
configuration options can control this behavior:
+
+- 
**[taskmanager.network.hybrid-shuffle.num-retained-in-memory-regions-max]({{< 
ref "docs/deployment/config" 
>}}#taskmanager-network-hybrid-shuffle-num-retained-in-memory-regions-max)** : 
Controls the max number of hybrid retained regions in memory. Increasing this 
value will allow more index entries to be cached in memory.
+- **[taskmanager.network.hybrid-shuffle.spill-index-segment-size]({{< ref 
"docs/deployment/config" 
>}}#taskmanager-network-hybrid-shuffle-spill-index-segment-size)** : Controls 
the segment size(in bytes) of hybrid spilled file data index.

Review Comment:
   I wonder if we want to advertise these options. They are likely removed in 
future releases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] xintongsong commented on a diff in pull request #21890: [FLINK-30860][doc] Add document for hybrid shuffle with adaptive batch scheduler

Reply via email to