[
https://issues.apache.org/jira/browse/FLINK-39030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39030:
-----------------------------------
Labels: pull-request-available (was: )
> Support emit-only-on-update mode for CUMULATE window TVF to reduce
> unnecessary outputs
> --------------------------------------------------------------------------------------
>
> Key: FLINK-39030
> URL: https://issues.apache.org/jira/browse/FLINK-39030
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / API
> Reporter: Liu
> Priority: Major
> Labels: pull-request-available
>
> h1. Summary
> Currently, the CUMULATE window TVF (Table-Valued Function) triggers output at
> every step interval, regardless of whether new data arrived during that
> period. This behavior leads to unnecessary repeated outputs and increased
> downstream processing overhead in scenarios where data arrives sparsely.
> This proposal introduces an optional emit-only-on-update mode for CUMULATE
> windows that only emits results when new data actually arrives within a step
> interval.
> h1. Motivation
> Consider a CUMULATE window with maxSize=15 seconds and step=5 seconds,
> starting at 00:00:00:
>
> ||Window||Time range||Triggered at||
> |1|[00:00:00, 00:00:05)|00:00:05|
> |2|[00:00:00, 00:00:10)|00:00:10|
> |3|[00:00:00, 00:00:15)|00:00:15|
> *Problem:* If data only arrives at 00:00:02, the current implementation still
> emits three outputs (at 00:00:05, 00:00:10, and 00:00:15), even though
> windows 2 and 3 contain no new data compared to window 1.
> *Impact:*
>
> * Unnecessary computational overhead
> * Increased network I/O for downstream operators
> * Higher storage costs when persisting results
> * Significant performance degradation in scenarios with sparse data
> distribution
> h1. Proposed Solution
> Add an optional 6th parameter emit_on_update_only (BOOLEAN, default false) to
> the CUMULATE TVF:
> {code:java}
> CUMULATE(
> TABLE data_table,
> DESCRIPTOR(rowtime),
> INTERVAL '5' SECOND, -- step
> INTERVAL '15' SECOND, -- maxSize
> INTERVAL '0' SECOND, -- offset
> true -- emit_on_update_only (NEW)
> ) {code}
> Behavior when emit_on_update_only = true:
> * Window output is only triggered when new data arrives within the current
> step interval
> * If no new data arrives, the window silently advances without emitting
> results
> * This is particularly useful for sparse data streams where most step
> intervals have no data
> Behavior when emit_on_update_only = false (default):
> * Current behavior is preserved (output at every step interval)
> * Ensures backward compatibility
--
This message was sent by Atlassian Jira
(v8.20.10#820010)