[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052446#comment-16052446 ]
Ethan Wang edited comment on PHOENIX-418 at 6/16/17 10:27 PM: -------------------------------------------------------------- Reply to [~pctony]: also in Blink DB, they have similar approximate aggregate feature which they use a sql clause. Example would be: `select count(id) From table ERROR 0.1 CONFIDENCE 95% http://blinkdb.org/#Performance was (Author: aertoria): Reply to [~pctony]: also in Blink DB, they have similar approximate aggregate feature which they use a sql clause. Example would be: `select count(id) From table ERROR 0.1 CONFIDENCE 95% > Support approximate COUNT DISTINCT > ---------------------------------- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task > Reporter: James Taylor > Assignee: maghamravikiran > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)