No, I'm referring to this [1] policy which has unexpected (and hardly
avoidable on the user-code side) data loss issues. The problem is that
assigning timestamps to elements and watermarks is completely decoupled
and unrelated, which I'd say is a bug.
Jan
[1]
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kinesis/KinesisIO.Read.html#withProcessingTimeWatermarkPolicy--
On 10/27/23 16:51, Alexey Romanenko wrote:
Why not just to create a custom watermark policy for that? Or you mean
to make it as a default policy?
—
Alexey
On 27 Oct 2023, at 10:25, Jan Lukavský <je...@seznam.cz> wrote:
Hi,
when discussing about [1] we found out, that the issue is actually
caused by processing time watermarks in KinesisIO. Enabling this
watermark outputs watermarks based on current processing time, _but
event timestamps are derived from ingestion timestamp_. This can
cause unbounded lateness when processing backlog. I think this setup
is error-prone and will likely cause data loss due to dropped
elements. This can be solved in two ways:
a) deprecate processing time watermarks, or
b) modify KinesisIO's watermark policy so that is assigns event
timestamps as well (the processing-time watermark policy would have
to derive event timestamps from processing-time).
I'd prefer option b) , but it might be a breaking change, moreover
I'm not sure if I understand the purpose of processing-time watermark
policy, it might be essentially ill defined from the beginning, thus
it might really be better to remove it completely. There is also a
related issue [2].
Any thoughts on this?
Jan
[1] https://github.com/apache/beam/issues/25975
[2] https://github.com/apache/beam/issues/28760