No, I'm referring to this [1] policy which has unexpected (and hardly avoidable on the user-code side) data loss issues. The problem is that assigning timestamps to elements and watermarks is completely decoupled and unrelated, which I'd say is a bug.

 Jan

[1] https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kinesis/KinesisIO.Read.html#withProcessingTimeWatermarkPolicy--

On 10/27/23 16:51, Alexey Romanenko wrote:
Why not just to create a custom watermark policy for that? Or you mean to make it as a default policy?

—
Alexey

On 27 Oct 2023, at 10:25, Jan Lukavský <je...@seznam.cz> wrote:


Hi,

when discussing about [1] we found out, that the issue is actually caused by processing time watermarks in KinesisIO. Enabling this watermark outputs watermarks based on current processing time, _but event timestamps are derived from ingestion timestamp_. This can cause unbounded lateness when processing backlog. I think this setup is error-prone and will likely cause data loss due to dropped elements. This can be solved in two ways:

 a) deprecate processing time watermarks, or

 b) modify KinesisIO's watermark policy so that is assigns event timestamps as well (the processing-time watermark policy would have to derive event timestamps from processing-time).

I'd prefer option b) , but it might be a breaking change, moreover I'm not sure if I understand the purpose of processing-time watermark policy, it might be essentially ill defined from the beginning, thus it might really be better to remove it completely. There is also a related issue [2].

Any thoughts on this?

 Jan

[1] https://github.com/apache/beam/issues/25975

[2] https://github.com/apache/beam/issues/28760

Reply via email to