[ https://issues.apache.org/jira/browse/KAFKA-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742549#comment-16742549 ]
Boyang Chen commented on KAFKA-7820: ------------------------------------ Hey Vinoth, thanks for proposing this! Based on your use case, I'm wondering whether we could repartition the input with all the cared fields are a compound key, and aggregate based on the key? That should be able to fulfill your requirement. > distinct count kafka streams api > -------------------------------- > > Key: KAFKA-7820 > URL: https://issues.apache.org/jira/browse/KAFKA-7820 > Project: Kafka > Issue Type: New Feature > Components: core > Reporter: Vinoth Rajasekar > Priority: Minor > > we are using Kafka streams for our real-time analytic use cases. most of our > use cases involved with doing distinct count on certain fields. > currently we do distinct count by storing the hash map value of the data in a > set and do a count as event flows in. There are lot of challenges doing this > using application memory, because storing the hashmap value and counting them > is limited by the allotted memory size. When we get high volume or spike in > traffic hash map of the distinct count fields grows beyond allotted memory > size leading to issues. > other issue is when we scale the app, we need to use global ktables so we > get all the values for doing distinct count and this adds back pressure in > the cluster or we have to re-partition the topic and do count on the key. > Can we have feature, where the distinct count is supported by through streams > api at the framework level, rather than dealing it with application level. -- This message was sent by Atlassian JIRA (v7.6.3#76005)