Jungtaek Lim created SPARK-42931:
------------------------------------

             Summary: dropDuplicates within watermark
                 Key: SPARK-42931
                 URL: https://issues.apache.org/jira/browse/SPARK-42931
             Project: Spark
          Issue Type: New Feature
          Components: Structured Streaming
    Affects Versions: 3.5.0
            Reporter: Jungtaek Lim


We got many reports that dropDuplicates does not clean up the state even though 
they have set a watermark for the query. We document the behavior clearly that 
the event time column should be a part of the subset columns for deduplication 
to clean up the state, but it cannot be applied to the customers as timestamps 
are not exactly the same for duplicated events in their use cases.

We propose to deduce a new API of dropDuplicates which has following different 
characteristics compared to existing dropDuplicates:
 * Weaker constraints on the subset (key)
 ** Does not require an event time column on the subset.
 * Looser semantics on deduplication
 ** Only guarantee to deduplicate events within the watermark.

Since the new API leverages event time, the new API has following new 
requirements:
 * The input must be streaming DataFrame.
 * The watermark must be defined.
 * The event time column must be defined in the input DataFrame.

More specifically on the semantic, once the operator processes the first 
arrived event, events arriving within the watermark for the first event will be 
deduplicated.
(Technically, the expiration time should be the “event time of the first 
arrived event + watermark delay threshold”, to match up with future events.)

Users are encouraged to set the delay threshold of watermark longer than max 
timestamp differences among duplicated events. (If they are unsure, they can 
alternatively set the delay threshold large enough, e.g. 48 hours.)

Longer design doc will be attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to