[ 
https://issues.apache.org/jira/browse/BEAM-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060353#comment-16060353
 ] 

Jingsong Lee commented on BEAM-2478:
------------------------------------

Count(Distinct) is a very interesting function.
It needs operator to count with the details of distinct field. This state is 
very huge sometimes.
There are three solutions as far as I know:
1.Count with all details of distinct field: I think we can use StatefulParDo 
with ValueState(Count) and SetState(For Distinct).
2.Approximation algorithm: cardinality(HyperLogLog) or bloomFilter or Bitmap. 
This can greatly reduce the amount of State data, but will lead to inaccurate. 
Apache Kylin use this.
3.Hierarchical calculation: 
select a, count(distinct b) from t group by a; -----> select a, count(1) from 
(select a, count(1) group by a,b) t2 group by a;
First operator distinct by b(also can do some local aggregate by a, will reduce 
the shuffle data) and second operator count by a. This can effectively reduce 
the state data, ease data skew. Apache Impala use this.

> Distinct Aggregates
> -------------------
>
>                 Key: BEAM-2478
>                 URL: https://issues.apache.org/jira/browse/BEAM-2478
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-sql
>            Reporter: Jingsong Lee
>            Assignee: Tarush Grover
>
> eg: COUNT(DISTINCT empno)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to