Nominate Beam Contributor of the Month - January

2022-01-19 Thread Aizhamal Nurmamat kyzy
Hey everybody,

I thought it would be great if we celebrated each other's work even more by
selecting Contributor of the Month by a popular vote. I'd really encourage
everyone to participate (even though it is optional) and let us know who is
doing excellent work on different parts of Beam. The winner will get a
small "Thank you" gift from Google's open source programs office.

It will be a very simple process: each month, I will start a thread asking
you to nominate someone you know by either filling out this form [1] or
just replying to me via email with the name of the person you are
nominating, their contact address and one sentence explaining why you are
nominating them.

Treat this as a January thread and start sending us the names!

Thank you :)

[1]
https://docs.google.com/forms/d/1377tB5RWeON2Fxn3CgHdO_hcXkpMGG6qNNGASSqrz8U/viewform?edit_requested=true=1757680734868150072


Re: [PROPOSAL] Batched DoFns in the Python SDK

2022-01-19 Thread Chad Dombrova
>
> Thanks Chad I'll take a look at your talk and design to see if there's any
> ideas we can merge.
>

Thanks Brian.  My hope is that even if you don't add the complete
scheduling framework, we'll get all the features and hooks we need to build
our toolset without needing to modify beam code (which is technical debt
that I'd rather not have).  Then we can offer our tool on PyPI for those
who want it.  I'm happy to have a call with you and the beam team to
discuss the details once you've had a look.

-chad


Re: [DISCUSS] propdeps removal and what to do going forward

2022-01-19 Thread Kenneth Knowles
On Wed, Jan 19, 2022 at 10:59 AM Brian Hulette  wrote:

> > But I believe the issue is that some tooling will bundle up the
> "compile" dependencies and submit with the job, which will then have a
> conflict with the libraries on the cluster.
>
> Do you know specifically what tooling does this? Also I'm not clear what
> you mean by "cluster", is this referring to remote workers?
>

No I don't, actually. I was being deliberately vague. I think "cluster"
could be Spark (or Flink) workers, etc. There are some flags to tweak whose
dependencies win. Dataflow workers also could have this issue. I guess
something building an uber jar would be the typical example.

To be super clear: I'm pretty much in agreement with Daniel, but I also
think there must be some good reason we should account for that it isn't
done that way already.

To unblock the release, Emily has opened a PR to match the prior pom
exactly, which I also think is smart. We can fix up / eliminate `provided`
deps one or a few at a time moving on.

Kenn


>
> On Tue, Jan 18, 2022 at 6:11 AM Kenneth Knowles  wrote:
>
>>
>>
>> On Fri, Jan 14, 2022 at 9:34 AM Daniel Collins 
>> wrote:
>>
>>> > In particular the Hadoop/Spark and Kafka dependencies must be
>>> **provided** as they were. I am not sure of others but those three matter.
>>>
>>> I think there's a bit of a difference here between what should be the
>>> state in the short term versus the long term.
>>>
>>> In the short term, I agree that we should avoid changes to how these
>>> dependencies are reflected in the POM.
>>>
>>> In the long term, I don't think it makes sense for these to continue to
>>> be "provided" dependencies- if users wish to use a different version of
>>> hadoop, spark or kafka, they can explicitly override the dependencies with
>>> the version they want when building their JAR, even if there is a version
>>> listed as "compile" in the POM file on maven central. The only difference
>>> is that if they don't have a version preference, the one listed in the POM
>>> (that we tested with) will be used, which seems like an unambiguous win to
>>> me.
>>>
>>
>> Agree with the sentiment. But I believe the issue is that some tooling
>> will bundle up the "compile" dependencies and submit with the job, which
>> will then have a conflict with the libraries on the cluster. On the other
>> hand, the user will always want to override the "provided" version to match
>> the cluster, in which case it will just be harmless duplicates on the
>> classpath, no? I guess huge file size, but it isn't the 90s any more. Since
>> Ismaël commented, maybe he can help to clarify. I also knew about this
>> reasoning for Spark & Hadoop but I don't know exactly what is required to
>> make it work right.
>>
>> This could become a bothersome issue long term - Gradle dev community has
>> lots of posts that indicate they don't agree with the existence of
>> "provided" or "optional" dependencies. (I happen to agree with them, but
>> philosophy is not the point). We should have a very clear solution for the
>> cases that require one, and document at least on the wiki.
>>
>> Kenn
>>
>>
>>>
>>> -Daniel
>>>
>>> On Thu, Jan 13, 2022 at 4:19 PM Ismaël Mejía  wrote:
>>>
 Optional dependencies should not be a major issue.

 What matters to validate that we are not breaking users is to compare
 the generated POM files with the previous (pre gradle 7 / 2.35.0)
 version and see that what was provided is still provided.

 In particular the Hadoop/Spark and Kafka dependencies must be
 **provided** as they were. I am not sure of others but those three
 matter.

 Ismaël

 On Wed, Jan 12, 2022 at 10:55 PM Emily Ye  wrote:
 >
 > We've chatted offline and have a tentative plan for what to do with
 these dependencies that are currently marked as compileOnly (instead of
 provided). Please review the list if possible [1].
 >
 > Two projects we aren't sure about:
 >
 > :sdks:java:io:hcatalog
 >
 > library.java.jackson_annotations
 > library.java.jackson_core
 > library.java.jackson_databind
 > library.java.hadoop_common
 > org.apache.hive:hive-exec
 > org.apache.hive.hcatalog:hive-hcatalog-core
 >
 > :sdks:java:io:parquet
 >
 > library.java.hadoop_client
 >
 >
 > Does anyone have experience with either of these IOs? ccing Chamikara
 >
 > Thank you,
 > Emily
 >
 >
 > [1]
 https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
 >
 > On Tue, Jan 11, 2022 at 6:38 PM Emily Ye  wrote:
 >>
 >> As the person volunteering to do fixes for this to unblock Beam
 2.36.0, I created a spreadsheet of the projects with dependencies changed
 from provided to compile only [1]. I pre-filled with what I think things
 should be, but I don't have very much background in java/maven/gradle
 configurations so please 

Flaky test issue report (45)

2022-01-19 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-13693: 
beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming timing out at 9 hours 
(created 2022-01-19)
https://issues.apache.org/jira/browse/BEAM-13575: Flink 
testParDoRequiresStableInput flaky (created 2021-12-28)
https://issues.apache.org/jira/browse/BEAM-13525: Java VR (Dataflow, V2, 
Streaming) failing: ParDoTest$TimestampTests/OnWindowExpirationTests (created 
2021-12-22)
https://issues.apache.org/jira/browse/BEAM-13522: Spark tests failing 
PerKeyOrderingTest (created 2021-12-22)
https://issues.apache.org/jira/browse/BEAM-13519: Java precommit flaky 
(timing out) (created 2021-12-22)
https://issues.apache.org/jira/browse/BEAM-13500: NPE in Flink Portable 
ValidatesRunner streaming suite (created 2021-12-21)
https://issues.apache.org/jira/browse/BEAM-13453: Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use 
(created 2021-12-13)
https://issues.apache.org/jira/browse/BEAM-13401: 
beam_PostCommit_Java_DataflowV2 
org.apache.beam.sdk.io.gcp.pubsublite.ReadWriteIT flaky (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13367: 
[beam_PostCommit_Python36] [ 
apache_beam.io.gcp.experimental.spannerio_read_it_test] Failure summary 
(created 2021-12-01)
https://issues.apache.org/jira/browse/BEAM-13312: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
 is flaky in Java Spark ValidatesRunner suite  (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13311: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite. (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13234: Flake in 
StreamingWordCountIT.test_streaming_wordcount_it (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13025: pubsublite.ReadWriteIT 
flaky in beam_PostCommit_Java_DataflowV2   (created 2021-10-08)
https://issues.apache.org/jira/browse/BEAM-12928: beam_PostCommit_Python36 
- CrossLanguageSpannerIOTest - flakey failing (created 2021-09-21)
https://issues.apache.org/jira/browse/BEAM-12859: 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12858: 
org.apache.beam.sdk.io.gcp.datastore.RampupThrottlingFnTest.testRampupThrottler 
is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12809: 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky (created 2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12794: 
PortableRunnerTestWithExternalEnv.test_pardo_timers flaky (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12793: 
beam_PostRelease_NightlySnapshot failed (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12673: 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey (created 2021-07-28)
https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in PipelineOptionsTest.test_display_data (created 2021-06-18)
https://issues.apache.org/jira/browse/BEAM-12322: Python precommit flaky: 
Failed to read inputs in the data plane (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12320: 
PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing in SQL 
PostCommit (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12200: 
SamzaStoreStateInternalsTest is flaky (created 2021-04-20)
https://issues.apache.org/jira/browse/BEAM-12163: Python GHA PreCommits 
flake with grpc.FutureTimeoutError on SDK harness startup (created 2021-04-13)
https://issues.apache.org/jira/browse/BEAM-12061: beam_PostCommit_SQL 
failing on KafkaTableProviderIT.testFakeNested (created 2021-03-27)
https://issues.apache.org/jira/browse/BEAM-11837: Java build flakes: 
"Memory constraints are impeding performance" (created 2021-02-18)
https://issues.apache.org/jira/browse/BEAM-11661: hdfsIntegrationTest 
flake: network not found (py38 postcommit) 

P1 issues report (66)

2022-01-19 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-13694: 
beam_PostCommit_Java_Hadoop_Versions failing with ClassDefNotFoundError 
(created 2022-01-19)
https://issues.apache.org/jira/browse/BEAM-13693: 
beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming timing out at 9 hours 
(created 2022-01-19)
https://issues.apache.org/jira/browse/BEAM-13686: OOM while logging a large 
pipeline even when logging level is higher (created 2022-01-19)
https://issues.apache.org/jira/browse/BEAM-13672: Window.into() without a 
windowFn not correctly translated to portable representation (created 
2022-01-17)
https://issues.apache.org/jira/browse/BEAM-13668: Java Spanner IO Request 
Count metrics broke backwards compatibility (created 2022-01-15)
https://issues.apache.org/jira/browse/BEAM-13615: Bumping up FnApi 
environment version to 9 in Java, Python SDK (created 2022-01-07)
https://issues.apache.org/jira/browse/BEAM-13611: 
CrossLanguageJdbcIOTest.test_xlang_jdbc_write failing in Python PostCommits 
(created 2022-01-07)
https://issues.apache.org/jira/browse/BEAM-13606: bigtable io doesn't 
handle non-ok row mutations (created 2022-01-07)
https://issues.apache.org/jira/browse/BEAM-13598: Install Java 17 on 
Jenkins VM (created 2022-01-04)
https://issues.apache.org/jira/browse/BEAM-13582: Beam website precommit 
mentions broken links, but passes. (created 2021-12-30)
https://issues.apache.org/jira/browse/BEAM-13579: Cannot run 
python_xlang_kafka_taxi_dataflow validation script on 2.35.0 (created 
2021-12-29)
https://issues.apache.org/jira/browse/BEAM-13522: Spark tests failing 
PerKeyOrderingTest (created 2021-12-22)
https://issues.apache.org/jira/browse/BEAM-13504: Remove 
provided/compileOnly deps not intended for external use (created 2021-12-21)
https://issues.apache.org/jira/browse/BEAM-13487: WriteToBigQuery Dynamic 
table destinations returns wrong tableId (created 2021-12-17)
https://issues.apache.org/jira/browse/BEAM-13430: Upgrade Gradle version to 
7.3 (created 2021-12-09)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13237: 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2 (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13213: OnWindowExpiration does 
not work without other state (created 2021-11-10)
https://issues.apache.org/jira/browse/BEAM-13203: Potential data loss when 
using SnsIO.writeAsync (created 2021-11-08)
https://issues.apache.org/jira/browse/BEAM-13164: Race between member 
variable being accessed due to leaking uninitialized state via 
OutboundObserverFactory (created 2021-11-01)
https://issues.apache.org/jira/browse/BEAM-13132: WriteToBigQuery submits a 
duplicate BQ load job if a 503 error code is returned from googleapi (created 
2021-10-27)
https://issues.apache.org/jira/browse/BEAM-13087: 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible 
(created 2021-10-20)
https://issues.apache.org/jira/browse/BEAM-13078: Python DirectRunner does 
not emit data at GC time (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13076: Python AfterAny, AfterAll 
do not follow spec (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13010: Delete orphaned files 
(created 2021-10-06)
https://issues.apache.org/jira/browse/BEAM-12995: Consumer group with 
random prefix (created 2021-10-04)
https://issues.apache.org/jira/browse/BEAM-12959: Dataflow error in 
CombinePerKey operation (created 2021-09-26)
https://issues.apache.org/jira/browse/BEAM-12867: Either Create or 
DirectRunner fails to produce all elements to the following transform (created 
2021-09-09)
https://issues.apache.org/jira/browse/BEAM-12843: (Broken Pipe induced) 
Bricked Dataflow Pipeline  (created 2021-09-06)
https://issues.apache.org/jira/browse/BEAM-12807: Java creates an incorrect 
pipeline proto when core-construction-java jar is not in the CLASSPATH (created 
2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12799: "Java IO IT Tests" - 
missing data in grafana (created 2021-08-25)
https://issues.apache.org/jira/browse/BEAM-12792: Beam worker only installs 
--extra_package once (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12621: Update Jenkins 

Re: Jira account to request committer permissions

2022-01-19 Thread Kenneth Knowles
Hi Bulat. Welcome!

I have added you to the "Contributors" role on Jira, so you can assign
tickets to yourself.

Kenn

On Wed, Jan 19, 2022 at 8:20 AM Bulat Safiullin (Akvelon INC) <
v-bsafiul...@microsoft.com> wrote:

> Dear Sir/Madam,
>
> Here is my JIRA account to request committer permissions:
> username: bulat.safiullin
> fullName: Bulat Safiullin
>
>
>


Jira account to request committer permissions

2022-01-19 Thread Bulat Safiullin (Akvelon INC)
Dear Sir/Madam,

Here is my JIRA account to request committer permissions:
username: bulat.safiullin
fullName: Bulat Safiullin




Re: Default output timestamp of processing-time timers

2022-01-19 Thread Kenneth Knowles
On Wed, Jan 19, 2022 at 1:11 AM Jan Lukavský  wrote:

> > One note - some people definitely use timer.withOutputTimestamp as a
> watermark hold.
>
> Definitely.
>
> > In fact, I do not view a "watermark hold" as a fundamental concept. The
> act of "set a timer with the intent that I am allowed to produce output
> with timestamp X" is the fundamental concept, and watermark hold is an
> implementation detail that should really never have been surfaced as an
> end-user concept, or really even as an SDK author concept.
>
> Agree that this need not be exposed explicitly, but the given the
> causality-preserving invariant that elements arriving *before* watermark
> *must not* leave after watermark I think that .withOutputTimestamp actually
> defines watermark hold implicitly. I think there is no other valid
> implementation than to hold output watermark not to cross the output
> timestamp of any active per-key timer (actually, we could distinguish cases
> when the timer is set for already late elements, there is no need - or
> possibility - to hold the watermark).
>
> I'd be also supportive for associating any buffer output timestamp with
> timer, rather than the buffer itself, as that really feels like a better
> description of what is *really* going to happen.
>
Is this just a way to connect the state, timer callback, and process
element. I wonder how it looks different or what we could do better with
this information. (I like these sorts of ideas, but I can't think of how it
would be different)

In the case Reuven described, where the timer callback does nothing, there
seems to be a real risk that data is left behind in the buffer when the
watermark hold is released. So you could, for example, have a timer
callback that always must accept the full contents of the buffer, and where
it is obvious to a user that the buffer is cleared after the callback. Like
OnWindowExpiration but OnBufferEviction.

> This was probably discussed, but I cannot see this in this discussion,
> what keeps us from setting output timestamp of processing-time timer to
> something like min(endOfWindow, currentOutputWatermark)? Yes, output
> watermark is not stable, but anything that is derived from _processing
> time_ is not stable by definition. For on-time elements, outputWatermark
> gives an estimation of the current position in event-time, so it makes
> sense to me to use that. Are there any counter examples?
>
This seems OK to me. Certainly the hold should never be based on processing
time.

Kenn

>  Jan
> On 1/18/22 21:10, Kenneth Knowles wrote:
>
> Yea, it makes sense. This is an issue for the global window where there
> isn't automatic cleanup of state. I've had a few user cases where they
> would like a good way of doing state cleanup in the global window too -
> something where whenever state gets buffer there is always a finite timer
> that will fire. There might be an opportunity here, if we attach the hold
> to that associated timer rather than the state. It sounds similar to what
> you describe where someone made a timer just to create a watermark hold
> associated with some state - I assume they actually do need to process and
> emit that state in some way related to the timer.
>
> On Tue, Jan 18, 2022 at 9:35 AM Reuven Lax  wrote:
>
>> Correct.
>>
>> IIRC originally we didn't want to add "buffered data timestamps"
>> because it was error prone. Leaking even one record in state holds up the
>> watermark and can cause the entire pipeline to grind to a halt. Associating
>> with a timer guarantees that holds are always cleared eventually.
>>
>> On Tue, Jan 18, 2022 at 9:13 AM Kenneth Knowles  wrote:
>>
>>> This is an interesting case, and a legitimate counterexample to
>>> consider. I'd call it a workaround :-). The semantic thing they would
>>> want/need is "output timestamp" associated with buffered data (also
>>> implemented with watermark hold). I do know systems that designed their
>>> state with this built in.
>>>
>>> Kenn
>>>
>>> On Tue, Jan 18, 2022 at 8:57 AM Reuven Lax  wrote:
>>>
 One note - some people definitely use timer.withOutputTimestamp as a
 watermark hold.

>>>
 This is a scenario in which one outputs (from processElement) a
 timestamp behind the current input element timestamp but knows that it is
 safe because there is already an extent timer with an earlier
 output timestamp (state can be used for this). In this case I've seen
 timers set simply for the hold - the actual onTimer never outputs anything.

 Reuven

 On Tue, Jan 18, 2022 at 6:42 AM Kenneth Knowles 
 wrote:

>
>
> On Tue, Dec 14, 2021 at 2:38 PM Steve Niemitz 
> wrote:
>
>> > I think this wouldn't be very robust to different situations where
>> processing time and event time may not be that close to each other.
>>
>> if you do something like `min(endOfWindow, max(eventInputTimestamp,
>> computedFiringTimestamp))` the worst case is that you set a 

Re: Default output timestamp of processing-time timers

2022-01-19 Thread Jan Lukavský
> One note - some people definitely use timer.withOutputTimestamp as a 
watermark hold.


Definitely.

> In fact, I do not view a "watermark hold" as a fundamental concept. 
The act of "set a timer with the intent that I am allowed to produce 
output with timestamp X" is the fundamental concept, and watermark hold 
is an implementation detail that should really never have been surfaced 
as an end-user concept, or really even as an SDK author concept.


Agree that this need not be exposed explicitly, but the given the 
causality-preserving invariant that elements arriving *before* watermark 
*must not* leave after watermark I think that .withOutputTimestamp 
actually defines watermark hold implicitly. I think there is no other 
valid implementation than to hold output watermark not to cross the 
output timestamp of any active per-key timer (actually, we could 
distinguish cases when the timer is set for already late elements, there 
is no need - or possibility - to hold the watermark).


I'd be also supportive for associating any buffer output timestamp with 
timer, rather than the buffer itself, as that really feels like a better 
description of what is *really* going to happen.


This was probably discussed, but I cannot see this in this discussion, 
what keeps us from setting output timestamp of processing-time timer to 
something like min(endOfWindow, currentOutputWatermark)? Yes, output 
watermark is not stable, but anything that is derived from _processing 
time_ is not stable by definition. For on-time elements, outputWatermark 
gives an estimation of the current position in event-time, so it makes 
sense to me to use that. Are there any counter examples?


 Jan

On 1/18/22 21:10, Kenneth Knowles wrote:
Yea, it makes sense. This is an issue for the global window where 
there isn't automatic cleanup of state. I've had a few user cases 
where they would like a good way of doing state cleanup in the global 
window too - something where whenever state gets buffer there is 
always a finite timer that will fire. There might be an opportunity 
here, if we attach the hold to that associated timer rather than the 
state. It sounds similar to what you describe where someone made a 
timer just to create a watermark hold associated with some state - I 
assume they actually do need to process and emit that state in some 
way related to the timer.


On Tue, Jan 18, 2022 at 9:35 AM Reuven Lax  wrote:

Correct.

IIRC originally we didn't want to add "buffered data timestamps"
because it was error prone. Leaking even one record in state
holds up the watermark and can cause the entire pipeline to grind
to a halt. Associating with a timer guarantees that holds are
always cleared eventually.

On Tue, Jan 18, 2022 at 9:13 AM Kenneth Knowles 
wrote:

This is an interesting case, and a legitimate counterexample
to consider. I'd call it a workaround :-). The semantic thing
they would want/need is "output timestamp" associated with
buffered data (also implemented with watermark hold). I do
know systems that designed their state with this built in.

Kenn

On Tue, Jan 18, 2022 at 8:57 AM Reuven Lax 
wrote:

One note - some people definitely use
timer.withOutputTimestamp as a watermark hold.


This is a scenario in which one outputs (from
processElement) a timestamp behind the current input
element timestamp but knows that it is safe because there
is already an extent timer with an earlier
output timestamp (state can be used for this). In this
case I've seen timers set simply for the hold - the actual
onTimer never outputs anything.

Reuven

On Tue, Jan 18, 2022 at 6:42 AM Kenneth Knowles
 wrote:



On Tue, Dec 14, 2021 at 2:38 PM Steve Niemitz
 wrote:

> I think this wouldn't be very robust to
different situations where processing time and
event time may not be that close to each other.

if you do something like `min(endOfWindow,
max(eventInputTimestamp,
computedFiringTimestamp))` the worst case is that
you set a watermark hold for somewhere in the
future, right?  For example, if the watermark is
lagging 3 hours, processing time = 4pm, event
input = 1pm, window end = 5pm, the watermark
hold/output time is set to 4pm + T. This would
make the timestamps "newer" than the input, but
shouldn't ever create late data, correct?

Also, imo, the timestamps really already cross
domains now, because the watermark (event time) is
held until the (processing