Sam Whittle created BEAM-11366:
----------------------------------

             Summary: Dataflow and KafkaIO has erratic watermark progress due 
to ReaderCache timeouts
                 Key: BEAM-11366
                 URL: https://issues.apache.org/jira/browse/BEAM-11366
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow
            Reporter: Sam Whittle
            Assignee: Sam Whittle


The ReaderCache has a default timeout of 1 minute. If the source is not polled 
within that time, the UnboundedReader is destroyed and will be reinitilized 
from the checkpoint when next polled. This is too short for runs where the 
source is not polled for a while due to other keys to process. In particular 
this seems to affect Streaming engine pipelines with backlog.

>From an initial look, recovery from the Kafka UnboundedReader checkpoint does 
>not seem to use the previous watermark for initializing the timestamp policy 
>and perhaps the timestamp policy will return unknown watermark if it has not 
>yet observed a record.

So possible fixes would be:
- make cache timeout configurable, or maybe dynamic, and increase it
- ensure that KafkaIO recovery from checkpoints initializes the watermark 
properly so that cache eviction does not matter for watermark smoothness



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to