[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Julian Hyde (JIRA) Mon, 22 May 2017 19:28:18 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020575#comment-16020575
 ]


Julian Hyde commented on CALCITE-1787:
--------------------------------------

Regarding 1. I'm assuming that you want to be able to write "select 
thetaSketch(customerId) from sales".

If so, I think that's not a great idea, because it's not declarative. You don't 
write "emp MERGE JOIN dept", you write "emp JOIN dept" and let the optimizer 
decide which algorithm to use.

I'd prefer we wrote "select count(distinct customerId) approximate (algorithm 
thetaSketch) from sales" or just "select count(distinct customerId) from sales" 
(using a session preference that thetaSketch or hyperLogLog can be used).

Regarding 2. HISTOGRAM_AGG is still in the code (see SqlStdOperatorTable) but 
it is not currently used. When it was used, we would generate plans like this:

{noformat}
SELECT orderId, productId, min(quantity) OVER w, max(quantity) OVER w
FROM Orders
WINDOW w AS (PARTITION BY productId
  ORDER BY orderTimestamp
  RANGE INTERVAL '1' HOUR PRECEDING)

Project($0, $1, $HistogramMin($2), $HistogramMax($2))
  Window($0, $1, HISTOGRAM_AGG($3) over (partition by $1 order by $2 range 
interval '1' hour preceding))
    Scan(Orders)
{noformat}

As you can see, we compute one aggregate, a histogram (basically TreeSet on top 
of a FIFO queue), then we have two extractor functions ($HistogramMin and 
$HistogramMax) to get the min and max from it.

> thetaSketch Support for Druid Adapter
> -------------------------------------
>
>                 Key: CALCITE-1787
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1787
>             Project: Calcite
>          Issue Type: New Feature
>          Components: druid
>    Affects Versions: 1.12.0
>            Reporter: Zain Humayun
>            Assignee: Julian Hyde
>            Priority: Minor
>             Fix For: 1.12.0
>
>
> Currently, the Druid adapter does not support the 
> [thetaSketch|http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html]
>  aggregate type, which is used to measure the cardinality of a column 
> quickly. Many Druid instances support theta sketches, so I think it would be 
> a nice feature to have.
> I've been looking at the Druid adapter, and propose we add a new DruidType 
> called {{thetaSketch}} and then add logic in the {{getJsonAggregation}} 
> method in class {{DruidQuery}} to generate the {{thetaSketch}} aggregate. 
> This will require accessing information about the columns (what data type 
> they are) so that the thetaSketch aggregate is only produced if the column's 
> type is {{thetaSketch}}. 
> Also, I've noticed that a {{hyperUnique}} DruidType is currently defined, but 
> a {{hyperUnique}} aggregate is never produced. Since both are approximate 
> aggregators, I could also couple in the logic for {{hyperUnique}}.
> I'd love to hear your thoughts on my approach, and any suggestions you have 
> for this feature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Reply via email to