Hi,

when discussing about [1] we found out, that the issue is actually caused by processing time watermarks in KinesisIO. Enabling this watermark outputs watermarks based on current processing time, _but event timestamps are derived from ingestion timestamp_. This can cause unbounded lateness when processing backlog. I think this setup is error-prone and will likely cause data loss due to dropped elements. This can be solved in two ways:

 a) deprecate processing time watermarks, or

 b) modify KinesisIO's watermark policy so that is assigns event timestamps as well (the processing-time watermark policy would have to derive event timestamps from processing-time).

I'd prefer option b) , but it might be a breaking change, moreover I'm not sure if I understand the purpose of processing-time watermark policy, it might be essentially ill defined from the beginning, thus it might really be better to remove it completely. There is also a related issue [2].

Any thoughts on this?

 Jan

[1] https://github.com/apache/beam/issues/25975

[2] https://github.com/apache/beam/issues/28760

Reply via email to