Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Say the eventtime column chosen is the time of delivery into something like 
Kafka.  Due to retries we end up with two events with different timestamps.  
Consider the following stream with a watermark threshold of `5` and where a 
blank line delineates batch boundaries.
    
    ```
    [id=a, t=0]
    
    [id=b, t=6]
    
    [id=a, t=10]
    ```
    
    The first batch emits `a` the second batch emits `b` and drops `a` from the 
store.  The third batch emits a duplicate `a`.
    
    Since that result seems pretty confusing, I think we may want to require 
the user to explicitly include the timestamp (or get an explicit though 
admittedly suboptimal OOM).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to