[ https://issues.apache.org/jira/browse/KAFKA-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias J. Sax updated KAFKA-10369: ------------------------------------ Labels: kip (was: ) > Introduce Distinct operation in KStream > --------------------------------------- > > Key: KAFKA-10369 > URL: https://issues.apache.org/jira/browse/KAFKA-10369 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Ivan Ponomarev > Assignee: Ivan Ponomarev > Priority: Major > Labels: kip > > Message deduplication is a common task. > One example: we might have multiple data sources each reporting its state > periodically with a relatively high frequency, their current states should be > stored in a database. In case the actual change of the state occurs with a > lower frequency than it is reported, in order to reduce the number of writes > to the database we might want to filter out duplicated messages using Kafka > Streams. > 'Distinct' operation is common in data processing, e. g. > * Java Stream has [distinct() > |https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#distinct--] > operation, > * SQL has DISTINCT keyword. > > Hence it is natural to expect the similar functionality from Kafka Streams. > Although Kafka Streams Tutorials contains an > [example|https://kafka-tutorials.confluent.io/finding-distinct-events/kstreams.html] > of how distinct can be emulated , but this example is complicated: it > involves low-level coding with local state store and a custom transformer. It > might be much more convenient to have distinct as a first-class DSL operation. > Due to 'infinite' nature of KStream, distinct operation should be windowed, > similar to windowed joins and aggregations for KStreams. > See > [KIP-655|https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API] -- This message was sent by Atlassian Jira (v8.3.4#803005)