In the world where druid-sql is where Druid's Calcite API lives, what do you think would make the most sense for the current calcite-druid module? Would it make sense to remove it (and merge anything it does, that druid-sql doesn't already do, into druid-sql) or to keep it in the Calcite project but have it be a thin wrapper over druid-sql?
I guess this should be informed by who the users of calcite-druid are. At this point, I don't know much beyond the fact that Hive uses it. Gian On Wed, Feb 7, 2018 at 10:29 AM, Julian Hyde <jh...@apache.org> wrote: > I agree with you both. > > For a particular engine, such as Druid, there are often 3 options: > > 1. build a Calcite adapter to the engine's native query language; > > 2. if the engine supports SQL, connect to the engine via Calcite's JDBC > adapter; > > 3. if the engine exposes an API based on Calcite algebra, connect to that > API. > > All of those options are valid for Druid right now, and 3 (Gian's > proposal) is likely to yield the best plans. As Gian correctly notes, > that is likely to increase the coupling, but we can live with that. > (If people want loose coupling they can talk to Druid via the JDBC > adapter, and we just need to make sure that the Druid JDBC dialect > knows that Druid cannot do joins.) > > Nishant's core point seems to be that we need some kind of bulk > API/protocol to talk to Druid, to consume partial query results in > parallel. This is desirable because Hive is -- how to put it > politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and > Drill would want a similar API/protocol. When it exists, we can > generate a hybrid plan: Druid physical algebra that generates partial > results in parallel underneath Hive physical algebra that consumes > those results in parallel. > > The same pattern occurred in Phoenix. Phoenix does not have > shuffle/exchange capabilities, so for big analytic queries we would > want to couple it with Hive/Spark/Presto/Drill. We talked about > Drillix (Drill + Phoenix) for a while but never completed it. > > Julian > > > On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa > <nishant.mon...@gmail.com> wrote: > > Having a focused effort into a single project would be great and would > > definitely help us in evolving druid sql capabilities faster. > > > > 1) One more thing that we need to consider here is that calcite > > druid-adapter is also used in Apache Hive where we use the druid rules to > > generate an optimized plan and then the druid query is executed from > druid > > containers. In druid-sql I believe the query execution logic is tied to > the > > fact that execution node is a druid-broker where native queries can be > run > > to generate a Sequence of results. We might need some rework there to > > ensure that things work fine with hive too after proposed changes. > > > > 2) druid-sql dependencies can probably be reduced by separating the > > planning and execution logic in druid-sql, the planning logic need not > > depend on lots of druid code and can have light-weight dependencies while > > the execution part and result serde which pulls in lots of druid > > dependencies can reside in separate module and calcite druid-adapter need > > not depend on that module. > > > > I think, the hypothetical case you mentioned is also worth considering, > to > > ease up the development process, we can consider moving calcite-druid as > a > > module in druid, so that we make release of both druid-sql and > > calcite-adapter together. > > > > On Wed, 7 Feb 2018 at 09:02 Gian Merlino <g...@imply.io> wrote: > > > >> Hi Calcites, > >> > >> I would like to raise the idea of adding druid-sql ( > >> > >> http://search.maven.org/#artifactdetails%7Cio.druid% > 7Cdruid-sql%7C0.11.0%7Cjar > >> ) > >> as a dependency in Calcite's Druid adapter. It should reduce the size of > >> calcite-druid substantially, since it would mostly just be calling into > >> druid-sql. > >> > >> This has some advantages for both projects. > >> > >> 1) Support for new Druid features often appears in Druid SQL first. By > >> embedding druid-sql, Calcite gets these new features too, without extra > >> work. For example https://issues.apache.org/jira/browse/CALCITE-2170 > is an > >> outstanding jira to add support for Druid expressions to Calcite, but > >> druid-sql already supports these. In fact it looks like some of the > code in > >> the proposed patch is copied from druid-sql. As another example, > >> https://issues.apache.org/jira/browse/CALCITE-2077 switched table scans > >> from "select" to "scan", which had been previously done in Druid SQL in > >> https://github.com/druid-io/druid/pull/4751. > >> > >> 2) Depending on druid-sql means Calcite doesn't need to implement its > own > >> Druid query and result serde code. Druid already has it. > >> > >> 3) Focused effort on a single module rather than the split effort that > we > >> have today, where some developers are contributing to druid-sql and some > >> are contributing to calcite-druid. > >> > >> 4) More test coverage for both projects, presumably. > >> > >> I think (3) and (4) especially would give us the opportunity to improve > >> both projects much more rapidly. > >> > >> However, there are also some possible disadvantages. > >> > >> 1) druid-sql is a somewhat heavyweight module. It pulls in a lot of > other > >> Druid code. Calcite users may prefer a lighter weight module. > >> > >> 2) druid-sql's APIs are not intended to be stable, and probably never > will > >> be. They may break on minor releases. So updating the version of > druid-sql > >> in Calcite may involve tweaking how functions are called, etc. I think > this > >> effort should be minimal if calcite-druid is mostly just delegating to > >> druid-sql. > >> > >> 3) druid-sql depends on calcite-core. This should usually be fine, but > it > >> means that if calcite-core has a breaking change, then calcite-druid > cannot > >> update its version of druid-sql until druid-sql first updates its > version > >> of calcite-core. > >> > >> Despite these potential difficulties, I think the potential benefit > means > >> this is worth exploring. > >> > >> Finally: a hypothetical. Why not do the other way around -- have Druid > add > >> calcite-druid as a dependency? The main reason is that this makes the > Druid > >> development process awkward when a new Druid SQL feature also requires a > >> new native query feature. Today, we develop the native query and SQL > sides > >> together. If Druid depended on calcite-druid, then we would need to > develop > >> the native query side first, then release it, then update Calcite's > Druid > >> adapter, then pull that back into Druid. Generally, just adding an extra > >> rule in druid-sql wouldn't be enough, since the sorts of changes we are > >> making at this point are typically more extensive than just adjusting > >> rules. > >> > >> Gian > >> >