[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Julian Hyde (JIRA) Fri, 09 Jun 2017 11:16:47 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044793#comment-16044793
 ]


Julian Hyde commented on CALCITE-1787:
--------------------------------------

bq. +1 for the idea of abstract metric or what we call in druid complex metric.

OK, let's call this collection "complexMetrics".

bq. Are we saying that for this to work the druid user has to follow this 
naming convention for columns?

If by "druid user" you mean someone writing Druid JSON queries, they would 
ignore the "user" complex metric and write their queries in terms of the 
sketches "user_unique" etc.

The person writing SQL would usually reference "user" in their SQL query. They 
could reference "user_unique" and "user_histogram" in their query but they 
would be VARBINARY values so there's not much they can do with them.

bq. Does this still work if we have multiple sketches for user ? (it is pretty 
common use case where the user is tracked via multiple streams hence multiple 
sketches)

Yes. I'm trying to make it easier to have multiple sketches for "user". And 
also to be able to add / remove sketches without re-writing the queries. 

bq. How calcite will be able to know the details about whether this sketch can 
be used as a histogram or count ?

The mapping

{code:javascript}
  {
    "name" : "user_unique",
    "type" : "hyperUnique",
    "fieldName" : "user"
  }
{code}

provides sufficient information for the planner to write approximate 
{{count(distinct user)}} to use the Druid's hyperUnique aggregator.

bq. Keep in mind that hyperUnique like Theta-sketches or Quantile-Histogram are 
UDFs so we can have different UDFs that does the same thing in the same table 
where each UDF has its own API and capabilities. As an example Theta-Sketches 
(Yahoo sketches) and druid HLL can be used to compute unique user estimate but 
T-Sketch can do intersection/subtract/union while HLL can only do union.

The Calcite's planner will only be able to write UDFs that it is aware of. For 
this release, that will be "hyperUnique" and "thetaSketch". We can add more but 
that will require a (small) code change to Druid adapter.

> thetaSketch Support for Druid Adapter
> -------------------------------------
>
>                 Key: CALCITE-1787
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1787
>             Project: Calcite
>          Issue Type: New Feature
>          Components: druid
>    Affects Versions: 1.12.0
>            Reporter: Zain Humayun
>            Assignee: Zain Humayun
>            Priority: Minor
>
> Currently, the Druid adapter does not support the 
> [thetaSketch|http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html]
>  aggregate type, which is used to measure the cardinality of a column 
> quickly. Many Druid instances support theta sketches, so I think it would be 
> a nice feature to have.
> I've been looking at the Druid adapter, and propose we add a new DruidType 
> called {{thetaSketch}} and then add logic in the {{getJsonAggregation}} 
> method in class {{DruidQuery}} to generate the {{thetaSketch}} aggregate. 
> This will require accessing information about the columns (what data type 
> they are) so that the thetaSketch aggregate is only produced if the column's 
> type is {{thetaSketch}}. 
> Also, I've noticed that a {{hyperUnique}} DruidType is currently defined, but 
> a {{hyperUnique}} aggregate is never produced. Since both are approximate 
> aggregators, I could also couple in the logic for {{hyperUnique}}.
> I'd love to hear your thoughts on my approach, and any suggestions you have 
> for this feature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Reply via email to