[ 
https://issues.apache.org/jira/browse/KAFKA-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Ponomarev updated KAFKA-10369:
-----------------------------------
    Description: 
Message deduplication is a common task.

One example: we might have multiple data sources each reporting its state 
periodically with a relatively high frequency, their current states should be 
stored in a database. In case the actual change of the state occurs with a 
lower frequency than it is reported, in order to reduce the number of writes to 
the database we might want to filter out duplicated messages using Kafka 
Streams.

'Distinct' operation is common in data processing, e. g.
 * Java Stream has [distinct() 
|https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#distinct--]
 operation,
 * SQL has DISTINCT keyword.

 

Hence it is natural to expect the similar functionality from Kafka Streams.

Although Kafka Streams Tutorials contains an 
[example|https://kafka-tutorials.confluent.io/finding-distinct-events/kstreams.html]
 of how distinct can be emulated , but this example is complicated: it involves 
low-level coding with local state store and a custom transformer. It might be 
much more convenient to have distinct as a first-class DSL operation.

Due to 'infinite' nature of KStream, distinct operation should be windowed, 
similar to windowed joins and aggregations for KStreams.

See 
[KIP-655|https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API]

  was:
Message deduplication is a common task.

One example: we might have multiple data sources each reporting its state 
periodically with a relatively high frequency, their current states should be 
stored in a database. In case the actual change of the state occurs with a 
lower frequency than it is reported, in order to reduce the number of writes to 
the database we might want to filter out duplicated messages using Kafka 
Streams.

'Distinct' operation is common in data processing, e. g.
 * Java Stream has [distinct() 
|https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#distinct--]
 operation,
 * SQL has DISTINCT keyword.

 

Hence it is natural to expect the similar functionality from Kafka Streams.

Although Kafka Streams Tutorials contains an 
[example|https://kafka-tutorials.confluent.io/finding-distinct-events/kstreams.html]
 of how distinct can be emulated , but this example is complicated: it involves 
low-level coding with local state store and a custom transformer. It might be 
much more convenient to have distinct as a first-class DSL operation.

Due to 'infinite' nature of KStream, distinct operation should be windowed, 
similar to windowed joins and aggregations for KStreams.


> Introduce Distinct operation in KStream
> ---------------------------------------
>
>                 Key: KAFKA-10369
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10369
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Ivan Ponomarev
>            Assignee: Ivan Ponomarev
>            Priority: Major
>
> Message deduplication is a common task.
> One example: we might have multiple data sources each reporting its state 
> periodically with a relatively high frequency, their current states should be 
> stored in a database. In case the actual change of the state occurs with a 
> lower frequency than it is reported, in order to reduce the number of writes 
> to the database we might want to filter out duplicated messages using Kafka 
> Streams.
> 'Distinct' operation is common in data processing, e. g.
>  * Java Stream has [distinct() 
> |https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#distinct--]
>  operation,
>  * SQL has DISTINCT keyword.
>  
> Hence it is natural to expect the similar functionality from Kafka Streams.
> Although Kafka Streams Tutorials contains an 
> [example|https://kafka-tutorials.confluent.io/finding-distinct-events/kstreams.html]
>  of how distinct can be emulated , but this example is complicated: it 
> involves low-level coding with local state store and a custom transformer. It 
> might be much more convenient to have distinct as a first-class DSL operation.
> Due to 'infinite' nature of KStream, distinct operation should be windowed, 
> similar to windowed joins and aggregations for KStreams.
> See 
> [KIP-655|https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to