[
https://issues.apache.org/jira/browse/BEAM-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732788#comment-16732788
]
Jan Doms commented on BEAM-6333:
--------------------------------
Seems like my understanding was a bit limited (or just plain incorrect). Thanks
for the pointer.
Would it be correct to say that the mechanism doesn't account for potential
clockdrift between the broker and the beam node?
Or scenarios where the broker restarts and for whatever reason goes back in
time a little, and then emits a record with a higher timestamp than the
previous record, but lower than what beam currently thinks the watermark should
be?
(I'm thinking of jepsen test like scenarios here ... but maybe I'm getting too
paranoid)
> Which determinism are you specifically referring to? Could you give an
> example?
The computation I'm doing (transaction processing) should be a 'pure' function,
in the sense that it should always give the same results (given the same
inputs). It should in theory be possible to perform the computation in parallel
in 2 separate datacenters and let them generate the same output. Or to
(partially) redo the computation later to re-determine a historical state.
Any incorrect watermark would be a problem. I am most sure that the behavior
would be correct if this mechanism is disabled and I generate heartbeat
messages on each partition instead.
> Btw, your implementation will throw exception since watermark can be queried
> even before any records are read, better to return 'min timestamp' in that
> case.
Thanks for the tip. It's actually still untested, and I guess you noticed ;)
> allow disabling/configuring automatic watermark generation for idle kafka
> partitions
> ------------------------------------------------------------------------------------
>
> Key: BEAM-6333
> URL: https://issues.apache.org/jira/browse/BEAM-6333
> Project: Beam
> Issue Type: Improvement
> Components: io-java-kafka
> Affects Versions: 2.9.0
> Reporter: Jan Doms
> Assignee: Raghu Angadi
> Priority: Major
> Labels: easyfix, pull-request-available
> Original Estimate: 0.5h
> Time Spent: 10m
> Remaining Estimate: 20m
>
> For pipelines that require the emitted watermarks to be absolutely correct
> the current behavior using 2 seconds timeout could lead to problems.
> Therefor it should be possible to disable this behavior. While changing the
> code the hardcoded 2 seconds can also be made configurable. (There's already
> a TODO about that in the code.)
>
> A (preliminary) PR making this change is available:
> https://github.com/apache/beam/pull/7382.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)