HeartSaVioR opened a new pull request, #40561:
URL: https://github.com/apache/spark/pull/40561

   ### What changes were proposed in this pull request?
   
   This PR proposes to introduce a new API of dropDuplicates which has 
following different characteristics compared to existing dropDuplicates:
   
   * Weaker constraints on the subset (key)
     * Does not require an event time column on the subset.
   * Looser semantics on deduplication
     * Only guarantee to deduplicate events within the watermark.
   
   Since the new API leverages event time, the new API has following new 
requirements:
   
   * The input must be streaming DataFrame.
   * The watermark must be defined.
   * The event time column must be defined in the input DataFrame.
   
   More specifically on the semantic, once the operator processes the first 
arrived event, events arriving within the watermark for the first event will be 
deduplicated.
   (Technically, the expiration time should be the “event time of the first 
arrived event + watermark delay threshold”, to match up with future events.)
   
   Users are encouraged to set the delay threshold of watermark longer than max 
timestamp differences among duplicated events. (If they are unsure, they can 
alternatively set the delay threshold large enough, e.g. 48 hours.)
   
   ### Why are the changes needed?
   
   Existing dropDuplicates API does not address the valid use case on streaming 
query.
   
   There are many cases where the event time is not exact the same, although 
these events are same. One example is duplicated events are produced due to 
non-idempotent writer where event time is issued from producer/broker side. 
Another example is that the value of event time is unstable and users want to 
use alternative timestamp e.g. ingestion time.
   
   For these case, users have to exclude event time column from subset of 
deduplication, but then the operator is unable to evict state, leading to 
indefinitely growing state.
   
   To allow eviction of state while event time column is not required to be a 
part of subset of deduplication, we need to loose the semantic for the API, 
which warrants a new API.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this introduces a new public API, dropDuplicatesWithinWatermark.
   
   ### How was this patch tested?
   
   New test suite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to