[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Zain Humayun (JIRA) Wed, 31 May 2017 17:39:29 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032263#comment-16032263
 ]


Zain Humayun commented on CALCITE-1787:
---------------------------------------

[~bslim] 
At the moment, an exception is thrown when the following group by query is 
issued:

{code:sql}
SELECT "user_unique", count("brand_name") FROM "foodmart" GROUP BY 
"user_unique";
{code}

{code:java}
java.lang.IllegalStateException: Unhandled value type: class java.lang.String
        at 
org.apache.calcite.avatica.util.AbstractCursor$BinaryAccessor.getString(AbstractCursor.java:813)
        at 
org.apache.calcite.avatica.AvaticaResultSet.getString(AvaticaResultSet.java:245)
        at sqlline.Rows$Row.<init>(Rows.java:183)
        at sqlline.IncrementalRows.hasNext(IncrementalRows.java:66)
        at sqlline.TableOutputFormat.print(TableOutputFormat.java:33)
        at sqlline.SqlLine.print(SqlLine.java:1663)
        at sqlline.Commands.execute(Commands.java:833)
        at sqlline.Commands.sql(Commands.java:732)
        at sqlline.SqlLine.dispatch(SqlLine.java:813)
        at sqlline.SqlLine.begin(SqlLine.java:686)
        at sqlline.SqlLine.start(SqlLine.java:398)
        at sqlline.SqlLine.main(SqlLine.java:291)
{code}

It appears that columns with sql type {{SqlTypeName.VARBINARY}} ("user_unique" 
from above) cause some trouble when being printed. To me, it doesn't seem to 
make much sense to allow queries that group by columns with type 
hyperUnique/thetaSketch. They have a very specialized purpose. I’m thinking we 
should instead display some sort of error message instead. If that is 
undesirable, then another solution would be to change the sql type to 
{{SqlTypeName.VARCHAR}} and internally treat these columns as varchars. I’d be 
interested to hear your thoughts on this.

Furthermore, the following sum query 

{code:sql}
SELECT SUM("user_unique") FROM "foodmart";
{code}

Will fail because the {{SUM}} function expects a column with type {{NUMERIC}}, 
but “user_unique” is of type {{VARBINARY}}. This behavior is correct, and I’ll 
add some test cases for it. 

Lastly, addressing filters. Conjunction of filters work fine when they’re 
pushed into Druid such as the simple case below 

{code:sql}
SELECT COUNT(DISTINCT "user_unique") FROM "foodmart" WHERE "the_month" = 
'April' AND "store_city" = 'Seattle';
{code}

The potential issue I see is when there is a filter that cannot be pushed into 
Druid, such as trying to filter by another metric. In those cases I’m a little 
unclear on what the behavior should be, since calcite will be handling the 
thetaSketch/hyperUnique objects returned directly. 

Ex:

{code:sql}
SELECT COUNT(DISTINCT "user_unique") FROM "foodmart" WHERE "store_sales" > 0;
{code}

Calcite will retrieve raw Druid rows and then internally do a count distinct on 
the “user_unique” column.

> thetaSketch Support for Druid Adapter
> -------------------------------------
>
>                 Key: CALCITE-1787
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1787
>             Project: Calcite
>          Issue Type: New Feature
>          Components: druid
>    Affects Versions: 1.12.0
>            Reporter: Zain Humayun
>            Assignee: Zain Humayun
>            Priority: Minor
>
> Currently, the Druid adapter does not support the 
> [thetaSketch|http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html]
>  aggregate type, which is used to measure the cardinality of a column 
> quickly. Many Druid instances support theta sketches, so I think it would be 
> a nice feature to have.
> I've been looking at the Druid adapter, and propose we add a new DruidType 
> called {{thetaSketch}} and then add logic in the {{getJsonAggregation}} 
> method in class {{DruidQuery}} to generate the {{thetaSketch}} aggregate. 
> This will require accessing information about the columns (what data type 
> they are) so that the thetaSketch aggregate is only produced if the column's 
> type is {{thetaSketch}}. 
> Also, I've noticed that a {{hyperUnique}} DruidType is currently defined, but 
> a {{hyperUnique}} aggregate is never produced. Since both are approximate 
> aggregators, I could also couple in the logic for {{hyperUnique}}.
> I'd love to hear your thoughts on my approach, and any suggestions you have 
> for this feature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CALCITE-1787) thetaSketch Support for Druid Adapter

Reply via email to