[ 
https://issues.apache.org/jira/browse/KAFKA-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-10369:
------------------------------------
    Labels: kip  (was: )

> Introduce Distinct operation in KStream
> ---------------------------------------
>
>                 Key: KAFKA-10369
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10369
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Ivan Ponomarev
>            Assignee: Ivan Ponomarev
>            Priority: Major
>              Labels: kip
>
> Message deduplication is a common task.
> One example: we might have multiple data sources each reporting its state 
> periodically with a relatively high frequency, their current states should be 
> stored in a database. In case the actual change of the state occurs with a 
> lower frequency than it is reported, in order to reduce the number of writes 
> to the database we might want to filter out duplicated messages using Kafka 
> Streams.
> 'Distinct' operation is common in data processing, e. g.
>  * Java Stream has [distinct() 
> |https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#distinct--]
>  operation,
>  * SQL has DISTINCT keyword.
>  
> Hence it is natural to expect the similar functionality from Kafka Streams.
> Although Kafka Streams Tutorials contains an 
> [example|https://kafka-tutorials.confluent.io/finding-distinct-events/kstreams.html]
>  of how distinct can be emulated , but this example is complicated: it 
> involves low-level coding with local state store and a custom transformer. It 
> might be much more convenient to have distinct as a first-class DSL operation.
> Due to 'infinite' nature of KStream, distinct operation should be windowed, 
> similar to windowed joins and aggregations for KStreams.
> See 
> [KIP-655|https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to