[ 
https://issues.apache.org/jira/browse/SPARK-42931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-42931:
---------------------------------
    Attachment: [External] Mini design doc_ dropDuplicates within watermark.pdf

> dropDuplicates within watermark
> -------------------------------
>
>                 Key: SPARK-42931
>                 URL: https://issues.apache.org/jira/browse/SPARK-42931
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 3.5.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>         Attachments: [External] Mini design doc_ dropDuplicates within 
> watermark.pdf
>
>
> We got many reports that dropDuplicates does not clean up the state even 
> though they have set a watermark for the query. We document the behavior 
> clearly that the event time column should be a part of the subset columns for 
> deduplication to clean up the state, but it cannot be applied to the 
> customers as timestamps are not exactly the same for duplicated events in 
> their use cases.
> We propose to deduce a new API of dropDuplicates which has following 
> different characteristics compared to existing dropDuplicates:
>  * Weaker constraints on the subset (key)
>  ** Does not require an event time column on the subset.
>  * Looser semantics on deduplication
>  ** Only guarantee to deduplicate events within the watermark.
> Since the new API leverages event time, the new API has following new 
> requirements:
>  * The input must be streaming DataFrame.
>  * The watermark must be defined.
>  * The event time column must be defined in the input DataFrame.
> More specifically on the semantic, once the operator processes the first 
> arrived event, events arriving within the watermark for the first event will 
> be deduplicated.
> (Technically, the expiration time should be the “event time of the first 
> arrived event + watermark delay threshold”, to match up with future events.)
> Users are encouraged to set the delay threshold of watermark longer than max 
> timestamp differences among duplicated events. (If they are unsure, they can 
> alternatively set the delay threshold large enough, e.g. 48 hours.)
> Longer design doc will be attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to