Re: Help with Calcite/Druid integration

Jonathan Wei Tue, 20 Oct 2015 16:54:59 -0700

I looked at some of the Drill rule injection code (e.g.,
MongoPushDownFilterForScan and related handling?), but that's using the
onMatch -> transformTo style. My current implementation is based off the
MongoDB adapter with ConverterRule classes for converting Logical* rels,
and I didn't really grasp how to relate those two styles of
match-transformation/conversion.


I ended up defining a TIME_BUCKET ScalarFunction in my DruidSchema class,
and I'm catching the TIME_BUCKET expression in my DruidProject class'
implement() function, this seems to work for a basic query:

"SELECT TIME_BUCKET(`time`, 'PT1H', 'Etc/UTC'), page, SUM(`count`) AS Edits
FROM druid.wikipedia WHERE '2015-10-01T00:00:00' <= `time` AND `time` <
'2015-10-28T00:00:00' GROUP BY TIME_BUCKET(`time`, 'PT1H', 'Etc/UTC'), page
ORDER BY Edits DESC LIMIT 8"

I wasn't able to get the aliasing working in the second SELECT example you
provided, e.g.

"SELECT TIME_BUCKET(`time`, 'PT1H', 'Etc/UTC') AS tbk, page, SUM(`count`)
AS Edits FROM druid.wikipedia WHERE '2015-10-01T00:00:00' <= `time` AND
`time` < '2015-10-28T00:00:00' GROUP BY tbk, page ORDER BY Edits DESC LIMIT
8"

results in:

org.apache.calcite.sql.validate.SqlValidatorException: Column 'tbk' not
found in any table


--------------------------------


Does anyone have any thoughts on my first question about handling nested
GroupBys?

e.g., when I feed in this SQL:

"SELECT `language`, SUM(Edits) as EditSum FROM (SELECT `language`, page,
SUM(`count`) AS Edits FROM druid.wikipedia WHERE '2015-10-01T00:00:00' <=
`time` AND `time` < '2015-10-28T00:00:00' GROUP BY `language`, page ORDER
BY Edits DESC LIMIT 12) GROUP BY `language` ORDER BY EditSum DESC LIMIT 6"

I get the following plan explanation after the Logical* -> Druid* converter
rules fire:

DruidToEnumerableConverter.ENUMERABLE.[1 DESC](
input=rel#1409:DruidSort.DRUID.[1 DESC](
input=rel#1407:DruidAggregate.DRUID.[](
input=rel#1405:DruidSort.DRUID.[2 DESC](
input=rel#1403:DruidAggregate.DRUID.[](
input=rel#1401:DruidFilter.DRUID.[](
input=rel#591:DruidTableScan.DRUID.[](table=[druid, wikipedia]),
condition=AND(<=('2015-10-01T00:00:00', $12), <($12,
'2015-10-28T00:00:00'))),
group={0, 1}, Edits=SUM($13)),
sort0=$2,dir0=DESC,fetch=12),
group={1}, EditSum=SUM($2)),
sort0=$1,dir0=DESC,fetch=6))

What logic would I use to determine where the inner SELECT starts in this
plan? Would I check the input expression in my DruidAggregate
implementation?

If so, how I would I know that these two expressions are at different
depths:

input=rel#1407:DruidAggregate.DRUID.[](
input=rel#1405:DruidSort.DRUID.[2 DESC](

while these two expressions are both at the inner level in the query?

input=rel#1403:DruidAggregate.DRUID.[](
input=rel#1401:DruidFilter.DRUID.[](

On Sat, Oct 17, 2015 at 2:39 PM, Ted Dunning <[email protected]> wrote:

> So you really just need a rule that recognizes the TIME_BUCKET function
> inside the GROUP BY, right?
>
> And then probably you need to make sure that this function is propagated
> correctly (this may be automatic with Calcite) to make the following
> equivalent:
>
>
> SELECT
> SUM(tweet_length) as TotalTweetLength
> FROM twitterstream
> GROUP BY TIME_BUCKET(__time, PT1H, 'Etc/UTC')
>
> SELECT
> SUM(tweet_length) as TotalTweetLength,
> TIME_BUCKET(__time, PT1H, 'Etc/UTC') as t
> FROM twitterstream
> GROUP BY t
>
>
> This might arise from nested queries or such and there are probably severe
> limits on how Druid can handle these buckets which will complicate your
> life.
>
> It might also be very helpful if you look into how Drill allows data
> sources to inject rules into the query optimizer. That is usually used to
> express what kinds of push-down a function can accept and this seems to be
> exactly such a case.
>
> For instance, presumably, you would like to handle cases like this:
>
> SELECT
> SUM(tweet_length) as TotalTweetLength
> FROM twitterstream
> GROUP BY TIME_BUCKET(__time, PT1H, 'Etc/UTC'), floor(TotalTweetLength/10)
>
> (assuming Druid can do this)
>
>
>
>
>
> On Fri, Oct 16, 2015 at 10:48 PM, Jonathan Wei <[email protected]> wrote:
>
> > The time bucketing I have in mind is a feature supported by an existing
> SQL
> > client for Druid, it would be used with GROUP BY:
> >
> > https://github.com/implydata/plyql
> >
> > An example of it would be:
> >
> > plyql -h 10.20.30.40 -i P1D -q "
> > SELECT
> > SUM(tweet_length) as TotalTweetLength
> > FROM twitterstream
> > GROUP BY TIME_BUCKET(__time, PT1H, 'Etc/UTC')
> > "
> >
> > The "GROUP BY TIME_BUCKET(__time, PT1H, 'Etc/UTC')" would be equivalent
> to
> > specifying the following within the Druid query JSON:
> >
> > ...
> > "granularity": {"type": "period", "period": "PT1H", "timeZone":
> "Etc/UTC"}
> > ...
> >
> >
> > On Fri, Oct 16, 2015 at 6:13 PM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > On Fri, Oct 16, 2015 at 4:33 PM, Jonathan Wei <[email protected]>
> wrote:
> > >
> > > > 2.) I would like to extend the SQL syntax with a new function for
> > > > specifying the desired time bucketing properties for a Druid query,
> to
> > be
> > > > translated into "granularity":
> > > >
> > > > http://druid.io/docs/0.8.1/querying/granularities.html
> > > >
> > > > Can someone point me to a good resource or example for doing this in
> > > > Calcite?
> > > >
> > >
> > > Why is this not a group by operation?
> > >
> >
>

Re: Help with Calcite/Druid integration

Reply via email to