[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Joshua Walters (JIRA) Mon, 05 Jun 2017 14:36:37 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037636#comment-16037636
 ]


Joshua Walters commented on CALCITE-1787:
-----------------------------------------

The sketch column (theta, HLL, etc) has to exist in Druid as a metric column, 
Druid needs it when building the segment files.

We could make the column abstract, the problem is how should the user be 
informed that this column can't be used for certain things. You can't filter on 
{{user_id = 123}} for example, or get distinct value with {{DISTINCT user_id}}, 
but you can do {{COUNT(DISTINCT user_id)}}. Eventually you could also do 
intersections and other set operations.

The problem is that this column is a pre-computed binary metric.

It almost seems like a UDF concept would be best here. The column is a binary 
type, which would correctly restrict usage in SQL from filters. If we could 
register a UDF to do {{COUNT DISTINCT}} or {{SET INTERSECTION}} we wouldn't 
have to muddle with the syntax of SQL. But then there is the overhead of having 
to build and register UDFs.

> thetaSketch Support for Druid Adapter
> -------------------------------------
>
>                 Key: CALCITE-1787
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1787
>             Project: Calcite
>          Issue Type: New Feature
>          Components: druid
>    Affects Versions: 1.12.0
>            Reporter: Zain Humayun
>            Assignee: Zain Humayun
>            Priority: Minor
>
> Currently, the Druid adapter does not support the 
> [thetaSketch|http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html]
>  aggregate type, which is used to measure the cardinality of a column 
> quickly. Many Druid instances support theta sketches, so I think it would be 
> a nice feature to have.
> I've been looking at the Druid adapter, and propose we add a new DruidType 
> called {{thetaSketch}} and then add logic in the {{getJsonAggregation}} 
> method in class {{DruidQuery}} to generate the {{thetaSketch}} aggregate. 
> This will require accessing information about the columns (what data type 
> they are) so that the thetaSketch aggregate is only produced if the column's 
> type is {{thetaSketch}}. 
> Also, I've noticed that a {{hyperUnique}} DruidType is currently defined, but 
> a {{hyperUnique}} aggregate is never produced. Since both are approximate 
> aggregators, I could also couple in the logic for {{hyperUnique}}.
> I'd love to hear your thoughts on my approach, and any suggestions you have 
> for this feature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Reply via email to