Github user marmbrus commented on the issue:
https://github.com/apache/spark/pull/17268
Say the eventtime column chosen is the time of delivery into something like
Kafka. Due to retries we end up with two events with different timestamps.
Consider the following stream with a watermark threshold of `5` and where a
blank line delineates batch boundaries.
```
[id=a, t=0]
[id=b, t=6]
[id=a, t=10]
```
The first batch emits `a` the second batch emits `b` and drops `a` from the
store. The third batch emits a duplicate `a`.
Since that result seems pretty confusing, I think we may want to require
the user to explicitly include the timestamp (or get an explicit though
admittedly suboptimal OOM).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]