Fabian Hueske created FLINK-39719:
-------------------------------------
Summary: TemporalRowTimeJoinOperator produces wrong results for
late probe-side records
Key: FLINK-39719
URL: https://issues.apache.org/jira/browse/FLINK-39719
Project: Flink
Issue Type: Bug
Components: Table SQL / Runtime
Affects Versions: 2.3.0
Reporter: Fabian Hueske
The `TemporalRowTimeJoinOperator` might produce wrong results for late-arriving
probe-side records.
This is expected because a main property of the operator is to evict state on
watermark progress. A late-arriving probe-side record might not find the
correct build-side record to join against. It could either find the correct
record, a newer (incorrect) record, or no record to join with.
This behavior causes problems especially if the probe-side input is a
retraction stream. Retraction messages (DELETE, UPDATE_BEFORE) are prone to be
late because their event-timestamp must match the event-time of the
corresponding INSERT record.
Joining retraction rows against incorrect build-side records means that
downstream operators cannot resolve the retraction message.
Since `TemporalRowTimeJoinOperator` is designed to evict state on watermark
progress, we should drop late probe-side records. Most other event/row-time
based operators such as windowed aggregations, row-time OVER aggregate,
interval join have the same behavior.
Similar to these operators, we should track the number of dropped late records
in a metric.
Dropping late records is of course not a perfect solution, but 1) predictable
(and IMO expected) behavior and 2) not worse than producing wrong results.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)