[DISCUSS] Some thoughts about the structure of the Druid Adapter

Zain Humayun Fri, 21 Jul 2017 16:14:03 -0700

Hello Calcite Community,



Over the past few months I’ve gotten to familiarize myself with the Druid 
adapter in Calcite, and wanted to have a discussion on the general structure of 
the Druid adapter. For those unfamiliar, Druid supports a hand full of 
queries[1]. Each query has a specialized use case, different trade offs, 
different fields, etc. At the moment, DruidQuery.java acts as the 
representation for all the different kinds of queries (Select, TopN, GroupBy, 
Timeseries). This works well if the queries are simple, but when we try to push 
more and more into Druid, we start to notice the drawbacks to this approach.



For one, when it comes time to actually issue the Druid query, a great deal of 
logic is required to determine which kind of Druid query to generate (see 
DruidQuery#getQuery). Another issue arises in the rules to push RelNodes into 
Druid. Certain rules will need to check whether pushing a RelNode results in a 
DruidQuery where the query type is valid. A good example of this DruidSortRule. 
Again, there is more logic to determine which kind of DruidQuery will be 
produced. This becomes a problem when one needs to add on to a rule, or needs 
to do the same check for their own rule. Lastly, I think that the current 
adapter structure makes it harder for newcomers to understand how the adapter 
works, and harder to add features to it. All of these problems get worse when 
we decide to support another (existing or future) Druid query type later on.



With all that said, it would then seem natural to represent each query type as 
it’s own RelNode (DruidSelectQuery, DruidTopNQuery, etc). DruidQuery can serve 
as an abstract base class that contains a rule to transform itself into the 
different kinds of druid query RelNodes. Each query type will contain it’s own 
set specialized rules to push RelNodes into them, a tailored cost function, and 
logic. The VolcanoPlanner takes care of the rest. Whichever druid query can 
achieve the lowest cost by pushing in the most RelNodes will be chosen and 
executed.



This, of course would be a very large refactor, but I think it would be 
beneficial in the long run. There’s at least one open JIRA ticket 
(CALCITE-1206) that would benefit from this change.



Anyways, i’d be interested to hear from the community on what their thoughts on 
this kind of change are.



Thanks!


Zain   


[1] http://druid.io/docs/0.10.0/querying/querying.html

[DISCUSS] Some thoughts about the structure of the Druid Adapter

Reply via email to