reuvenlax commented on PR #36838:
URL: https://github.com/apache/beam/pull/36838#issuecomment-3558816461

   I'm not convinced that checking against the watermark is correct for
   several reasons. However I agree that the current check is a bit
   restrictive in cases like this.
   
   On Thu, Nov 20, 2025 at 7:42 AM Kenn Knowles ***@***.***>
   wrote:
   
   > *kennknowles* left a comment (apache/beam#36838)
   > <https://github.com/apache/beam/pull/36838#issuecomment-3558736022>
   >
   > Thanks @kennknowles <https://github.com/kennknowles>, @clairemcginty
   > <https://github.com/clairemcginty> is planning to do some testing today.
   >
   > Ran our tests
   > <https://github.com/spotify/scio/actions/runs/19513763696/job/55860003717>
   > with a locally built snapshot of this branch, still seeing the following
   > error:
   >
   > [info]   Cause: java.lang.IllegalArgumentException: Cannot output with 
timestamp 1970-01-01T00:00:00.001Z. Output timestamps must be no earlier than 
the timestamp of the current input or timer (1970-01-01T00:00:00.010Z) minus 
the allowed skew (0 milliseconds) and no later than 294247-01-10T04:00:54.775Z. 
See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the 
allowed skew.
   > [info]   at 
org.apache.beam.runners.core.SimpleDoFnRunner.checkTimestamp(SimpleDoFnRunner.java:263)
   > [info]   at 
org.apache.beam.runners.core.SimpleDoFnRunner.access$1300(SimpleDoFnRunner.java:89)
   > [info]   at 
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.lambda$outputWindowedValue$0(SimpleDoFnRunner.java:464)
   > [info]   at 
org.apache.beam.sdk.values.WindowedValues$Builder.output(WindowedValues.java:222)
   > [info]   at 
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.outputWindowedValue(SimpleDoFnRunner.java:467)
   > [info]   at 
org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:123)
   > [info]   at 
org.apache.beam.sdk.values.WindowedValues$Builder.output(WindowedValues.java:222)
   > [info]   at 
org.apache.beam.sdk.transforms.DoFn$OutputReceiver.outputWindowedValue(DoFn.java:416)
   > [info]   at 
com.spotify.scio.transforms.FileDownloadDoFn.lambda$processElement$0(FileDownloadDoFn.java:105)
   > [info]   at 
com.spotify.scio.transforms.FileDownloadDoFn.flush(FileDownloadDoFn.java:141)
   > [info]   ...
   >
   > is it possible the precision of the timestamp checks changed, or something
   > like that? 1970-01-01T00:00:00.001Z and 1970-01-01T00:00:00.010Z are only
   > off by a millisecond
   >
   > I don't think so. There's https://s.apache.org/beam-timestamp-strategy
   > but I don't think it is related to this.
   >
   > Please note that this failure is in a different place than the original
   > report so my commentary may not apply to both. I looked at the Scio code at
   > 
https://github.com/spotify/scio/blob/041ae06b881dc142ec7e5e6392af7bf627a5d75f/scio-core/src/main/java/com/spotify/scio/transforms/FileDownloadDoFn.java#L105
   > and I do see that the DoFn is pulling elements from batch and outputting
   > them in the context of a different element. So this will definitely look to
   > the runtime like outputting an element with a timestamp that is older than
   > the current input element. I think the fact that it worked before may have
   > just been luck. Though I can see how unlikely it is for it to have worked
   > for as long as it has on that basis. Nonetheless the fact that the
   > FileDownloadDoFn would hit this error is pretty plain.
   >
   > In discussion with @scwhittle <https://github.com/scwhittle> he points
   > out that this would be perfectly legal if done from @FinishBundle. So it
   > is a bit restrictive to disallow it in a process element. The use case of
   > accumulating a batch in a local variable and flushing periodically as in
   > FileDownloadDoFn is clearly something we want to allow. To do so we have
   > a few options:
   >
   >    - Eliminate the timestamp skew check. This might be fine, especially
   >    if it is done intentionally. The check is just to protect DoFn authors 
from
   >    error.
   >    - Track all timestamps in the bundle and allow skew back to the
   >    earliest in the bundle. This seems pretty ad hoc. It allows this use 
case
   >    but without sound reasoning.
   >    - Check timestamps against the watermark rather than the current
   >    element. I think this is principled but I'm not sure if we have plumbed 
the
   >    needed watermark to the right place for this to be easy.
   >
   > I suggest this:
   >
   >    - Immediately, you can just set allowed skew to a large number and it
   >    will effectively disable the check.
   >    - Short term, we can add a method to just disable the check entirely
   >    for a DoFn that has this computation pattern.
   >    - Medium term, we can revise it to check against the watermark, after
   >    thinking a bit more to ensure this is going to be ok.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/beam/pull/36838#issuecomment-3558736022>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AFAYJVP2PIHWBTR6TH64DDT35XOPZAVCNFSM6AAAAACMLTMENKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJYG4ZTMMBSGI>
   > .
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to