[
https://issues.apache.org/jira/browse/FLINK-35076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
elon_X updated FLINK-35076:
---------------------------
Attachment: (was: image-2024-04-10-20-13-14-752.png)
> Watermark alignment will cause data flow to experience serious shake
> --------------------------------------------------------------------
>
> Key: FLINK-35076
> URL: https://issues.apache.org/jira/browse/FLINK-35076
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.16.1
> Reporter: elon_X
> Priority: Major
> Attachments: image-2024-04-10-20-15-05-731.png,
> image-2024-04-10-20-23-13-872.png
>
>
> In our company, there is a requirement scenario for multi-stream join
> operations, we are making modifications based on Flink watermark alignment,
> then I found that the final join output would experience serious shake.
> and I analyzed the reasons: an upstream topic has more than 300 partitions.
> The number of partitions requested for this topic is too large, causing some
> partitions to frequently experience intermittent writes with QPS=0. This
> phenomenon is more serious between 2 am and 5 am.However, the overall topic
> writing is very smooth.
>
> The final join output will experience serious shake, as shown in the
> following diagram:
> !image-2024-04-10-20-15-05-731.png!
> Root cause:
> # The {{SourceOperator#emitLatestWatermark}} reports the
> lastEmittedWatermark to the SourceCoordinator.
> # If the partition write is zero during a certain period, the
> lastEmittedWatermark sent by the subtask corresponding to that partition
> remains unchanged.
> # The SourceCoordinator aggregates the watermarks of all subtasks according
> to the watermark group and takes the smallest watermark. This means that the
> maxAllowedWatermark may remain unchanged for some time, even though the
> overall upstream data flow is moving forward, until that minimum value is
> updated. Only then will everything change, which will manifest as serious
> shake in the output data stream.
> I think choosing the global minimum might not be a good option. Using min/max
> could more likely encounter some edge cases. Perhaps choosing a median value
> would be more appropriate? Or a more complex selection strategy?
> If replaced with a median value, it can ensure that the overall data flow is
> very smooth:
> !image-2024-04-10-20-23-13-872.png!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)