[ 
https://issues.apache.org/jira/browse/BEAM-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733506#comment-16733506
 ] 

Jan Doms commented on BEAM-6333:
--------------------------------

> Yes. 2 seconds is expected to cover things like that. The real fix is for 
> Kafka to provide its server side timestamp.

yea, if the kafka broker would do that, with the right guarantees, that would 
be better indeed.

I prefer not to rely on any timings for correctness (except for timestamps that 
have been persisted to disk and that I use for the eventtime).

> Regd determinism in your case, do those two pipelines read from same Kafka 
> cluster or two different ones? If it is from separate clusters, do you 
> tightly control the timestamps? If it is the same cluster, reprocessing 
> should have the same timestamps.

It should be the same cluster, with exactly the same timestamps, as otherwise 
it will get a different outcome.

> Overall, it looks like yours is a specific enough case and you can have your 
> own implementation that does the right thing. Watermark is a pretty hard for 
> causual users to be concerned. Adding more options may not help much. The 
> current default implantation does the right thing and handles idle partitions 
> well. There is nothing worse than user's having to debug stuck pipelines due 
> to obscure watermark stuckness due to something like an idle partition... 
> often they are not experts on Kafka or how the input is processed. More 
> advanced users like yourselves can utilize the APIs and do the right thing.

> Let us know if you still want to have an option to disable it. More user 
> accessible options implies more perceived complexity.

ok, fair enough. keeping it simple for the common user is a reasonable concern. 
and indeed, power users (or just the paranoid) can always implement their own 
timestamp/watermark policy.

although it is worth noting that for safety reasons (potential clockdrift) 
you've chosen a rather lengthy 2 seconds delay. (well, lengthy of course 
depends on the use-case.) using the heartbeats users can get a tighter bound. 
not sure if that is worth mentioning in the method documentation?

so thanks for the interaction and feel free to close the ticket and the PR. 
kind regards, Jan

> allow disabling/configuring automatic watermark generation for idle kafka 
> partitions
> ------------------------------------------------------------------------------------
>
>                 Key: BEAM-6333
>                 URL: https://issues.apache.org/jira/browse/BEAM-6333
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-kafka
>    Affects Versions: 2.9.0
>            Reporter: Jan Doms
>            Assignee: Raghu Angadi
>            Priority: Major
>              Labels: easyfix, pull-request-available
>   Original Estimate: 0.5h
>          Time Spent: 10m
>  Remaining Estimate: 20m
>
> For pipelines that require the emitted watermarks to be absolutely correct 
> the current behavior using 2 seconds timeout could lead to problems.
> Therefor it should be possible to disable this behavior. While changing the 
> code the hardcoded 2 seconds can also be made configurable. (There's already 
> a TODO about that in the code.)
>  
> A (preliminary) PR making this change is available: 
> https://github.com/apache/beam/pull/7382.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to