Ahh, ok, I see.

Yes, it looks like a bug. So, I'd propose to deprecate the old "processing 
time” watermark policy, which we can remove later, and create a new fixed one.

PS: It’s recommended to use "org.apache.beam.sdk.io.aws2.kinesis.KinesisIO” 
instead of deprecated “org.apache.beam.sdk.io.kinesis.KinesisIO” one.

—
Alexey

> On 27 Oct 2023, at 17:42, Jan Lukavský <je...@seznam.cz> wrote:
> 
> No, I'm referring to this [1] policy which has unexpected (and hardly 
> avoidable on the user-code side) data loss issues. The problem is that 
> assigning timestamps to elements and watermarks is completely decoupled and 
> unrelated, which I'd say is a bug.
> 
>  Jan
> 
> [1] 
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/kinesis/KinesisIO.Read.html#withProcessingTimeWatermarkPolicy--
> 
> On 10/27/23 16:51, Alexey Romanenko wrote:
>> Why not just to create a custom watermark policy for that? Or you mean to 
>> make it as a default policy?
>> 
>> —
>> Alexey
>> 
>>> On 27 Oct 2023, at 10:25, Jan Lukavský <je...@seznam.cz> 
>>> <mailto:je...@seznam.cz> wrote:
>>> 
>>> 
>>> Hi, 
>>> 
>>> when discussing about [1] we found out, that the issue is actually caused 
>>> by processing time watermarks in KinesisIO. Enabling this watermark outputs 
>>> watermarks based on current processing time, _but event timestamps are 
>>> derived from ingestion timestamp_. This can cause unbounded lateness when 
>>> processing backlog. I think this setup is error-prone and will likely cause 
>>> data loss due to dropped elements. This can be solved in two ways: 
>>> 
>>>  a) deprecate processing time watermarks, or 
>>> 
>>>  b) modify KinesisIO's watermark policy so that is assigns event timestamps 
>>> as well (the processing-time watermark policy would have to derive event 
>>> timestamps from processing-time). 
>>> 
>>> I'd prefer option b) , but it might be a breaking change, moreover I'm not 
>>> sure if I understand the purpose of processing-time watermark policy, it 
>>> might be essentially ill defined from the beginning, thus it might really 
>>> be better to remove it completely. There is also a related issue [2]. 
>>> 
>>> Any thoughts on this? 
>>> 
>>>  Jan 
>>> 
>>> [1] https://github.com/apache/beam/issues/25975 
>>> 
>>> [2] https://github.com/apache/beam/issues/28760 
>>> 
>> 

Reply via email to