[
https://issues.apache.org/jira/browse/KAFKA-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ayoub Omari reassigned KAFKA-10369:
-----------------------------------
Assignee: Ayoub Omari (was: Ivan Ponomarev)
> Introduce Distinct operation in KStream
> ---------------------------------------
>
> Key: KAFKA-10369
> URL: https://issues.apache.org/jira/browse/KAFKA-10369
> Project: Kafka
> Issue Type: Improvement
> Components: streams
> Reporter: Ivan Ponomarev
> Assignee: Ayoub Omari
> Priority: Major
> Labels: kip
>
> Message deduplication is a common task.
> One example: we might have multiple data sources each reporting its state
> periodically with a relatively high frequency, their current states should be
> stored in a database. In case the actual change of the state occurs with a
> lower frequency than it is reported, in order to reduce the number of writes
> to the database we might want to filter out duplicated messages using Kafka
> Streams.
> 'Distinct' operation is common in data processing, e. g.
> * Java Stream has [distinct()
> |https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#distinct--]
> operation,
> * SQL has DISTINCT keyword.
>
> Hence it is natural to expect the similar functionality from Kafka Streams.
> Although Kafka Streams Tutorials contains an
> [example|https://kafka-tutorials.confluent.io/finding-distinct-events/kstreams.html]
> of how distinct can be emulated , but this example is complicated: it
> involves low-level coding with local state store and a custom transformer. It
> might be much more convenient to have distinct as a first-class DSL operation.
> Due to 'infinite' nature of KStream, distinct operation should be windowed,
> similar to windowed joins and aggregations for KStreams.
> See
> [KIP-655|https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)