[
https://issues.apache.org/jira/browse/KAFKA-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744663#comment-16744663
]
Guozhang Wang commented on KAFKA-7820:
--------------------------------------
[~vinubarro] Thanks for sharing your use case. I think the proposal 2) from
[~bchen225242] may well fit your needs. To be more specific: say you need 10-20
fields that require distinct counts, you can create a repartition key which is
a combo of all of these fields via a single repartition topic. For example, if
your interested fields are A,B,C, and you create a combo key is (A,B,C), the
semantics of a co-partition key is that "all the records with the same values
in A,B,C will go to the same partition", which inplies "all the records with
the same values of A will go to the same partition" (same for B, C), so after
you've done the repartitioning, say to distinctly count on field A, you can
aggregate on B/C and count on A, and aggregate on A/C to count on B etc.
> distinct count kafka streams api
> --------------------------------
>
> Key: KAFKA-7820
> URL: https://issues.apache.org/jira/browse/KAFKA-7820
> Project: Kafka
> Issue Type: New Feature
> Components: streams
> Reporter: Vinoth Rajasekar
> Priority: Minor
> Labels: needs-kip
>
> we are using Kafka streams for our real-time analytic use cases. most of our
> use cases involved with doing distinct count on certain fields.
> currently we do distinct count by storing the hash map value of the data in a
> set and do a count as event flows in. There are lot of challenges doing this
> using application memory, because storing the hashmap value and counting them
> is limited by the allotted memory size. When we get high volume or spike in
> traffic hash map of the distinct count fields grows beyond allotted memory
> size leading to issues.
> other issue is when we scale the app, we need to use global ktables so we
> get all the values for doing distinct count and this adds back pressure in
> the cluster or we have to re-partition the topic and do count on the key.
> Can we have feature, where the distinct count is supported by through streams
> api at the framework level, rather than dealing it with application level.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)