So, it sounds like the first thing to look at would be seeing if the Hive folks are open to using druid-sql instead of calcite-druid. What'd be the best way to go about that? Nishant- do you think you could help?
Gian On Wed, Feb 7, 2018 at 3:46 PM, Julian Hyde <[email protected]> wrote: > Long term there doesn’t seem to be any point keeping Calcite’s druid > adapter around. The code would be an inferior duplicate of druid-sql, so we > would want to > > But shorter term there will be quite a few things that Hive needs that > will only exist in Calcite’s druid adapter. The challenge will be the > transition. You will need to convince the Hive developers that the move is > worthwhile. (It will help if you can point to some quick benefits to making > the transition.) > > Julian > > > > On Feb 7, 2018, at 2:59 PM, Gian Merlino <[email protected]> wrote: > > > > In the world where druid-sql is where Druid's Calcite API lives, what do > > you think would make the most sense for the current calcite-druid module? > > Would it make sense to remove it (and merge anything it does, that > > druid-sql doesn't already do, into druid-sql) or to keep it in the > Calcite > > project but have it be a thin wrapper over druid-sql? > > > > I guess this should be informed by who the users of calcite-druid are. At > > this point, I don't know much beyond the fact that Hive uses it. > > > > Gian > > > > On Wed, Feb 7, 2018 at 10:29 AM, Julian Hyde <[email protected]> wrote: > > > >> I agree with you both. > >> > >> For a particular engine, such as Druid, there are often 3 options: > >> > >> 1. build a Calcite adapter to the engine's native query language; > >> > >> 2. if the engine supports SQL, connect to the engine via Calcite's JDBC > >> adapter; > >> > >> 3. if the engine exposes an API based on Calcite algebra, connect to > that > >> API. > >> > >> All of those options are valid for Druid right now, and 3 (Gian's > >> proposal) is likely to yield the best plans. As Gian correctly notes, > >> that is likely to increase the coupling, but we can live with that. > >> (If people want loose coupling they can talk to Druid via the JDBC > >> adapter, and we just need to make sure that the Druid JDBC dialect > >> knows that Druid cannot do joins.) > >> > >> Nishant's core point seems to be that we need some kind of bulk > >> API/protocol to talk to Druid, to consume partial query results in > >> parallel. This is desirable because Hive is -- how to put it > >> politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and > >> Drill would want a similar API/protocol. When it exists, we can > >> generate a hybrid plan: Druid physical algebra that generates partial > >> results in parallel underneath Hive physical algebra that consumes > >> those results in parallel. > >> > >> The same pattern occurred in Phoenix. Phoenix does not have > >> shuffle/exchange capabilities, so for big analytic queries we would > >> want to couple it with Hive/Spark/Presto/Drill. We talked about > >> Drillix (Drill + Phoenix) for a while but never completed it. > >> > >> Julian > >> > >> > >> On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa > >> <[email protected]> wrote: > >>> Having a focused effort into a single project would be great and would > >>> definitely help us in evolving druid sql capabilities faster. > >>> > >>> 1) One more thing that we need to consider here is that calcite > >>> druid-adapter is also used in Apache Hive where we use the druid rules > to > >>> generate an optimized plan and then the druid query is executed from > >> druid > >>> containers. In druid-sql I believe the query execution logic is tied to > >> the > >>> fact that execution node is a druid-broker where native queries can be > >> run > >>> to generate a Sequence of results. We might need some rework there to > >>> ensure that things work fine with hive too after proposed changes. > >>> > >>> 2) druid-sql dependencies can probably be reduced by separating the > >>> planning and execution logic in druid-sql, the planning logic need not > >>> depend on lots of druid code and can have light-weight dependencies > while > >>> the execution part and result serde which pulls in lots of druid > >>> dependencies can reside in separate module and calcite druid-adapter > need > >>> not depend on that module. > >>> > >>> I think, the hypothetical case you mentioned is also worth considering, > >> to > >>> ease up the development process, we can consider moving calcite-druid > as > >> a > >>> module in druid, so that we make release of both druid-sql and > >>> calcite-adapter together. > >>> > >>> On Wed, 7 Feb 2018 at 09:02 Gian Merlino <[email protected]> wrote: > >>> > >>>> Hi Calcites, > >>>> > >>>> I would like to raise the idea of adding druid-sql ( > >>>> > >>>> http://search.maven.org/#artifactdetails%7Cio.druid% > >> 7Cdruid-sql%7C0.11.0%7Cjar > >>>> ) > >>>> as a dependency in Calcite's Druid adapter. It should reduce the size > of > >>>> calcite-druid substantially, since it would mostly just be calling > into > >>>> druid-sql. > >>>> > >>>> This has some advantages for both projects. > >>>> > >>>> 1) Support for new Druid features often appears in Druid SQL first. By > >>>> embedding druid-sql, Calcite gets these new features too, without > extra > >>>> work. For example https://issues.apache.org/jira/browse/CALCITE-2170 > >> is an > >>>> outstanding jira to add support for Druid expressions to Calcite, but > >>>> druid-sql already supports these. In fact it looks like some of the > >> code in > >>>> the proposed patch is copied from druid-sql. As another example, > >>>> https://issues.apache.org/jira/browse/CALCITE-2077 switched table > scans > >>>> from "select" to "scan", which had been previously done in Druid SQL > in > >>>> https://github.com/druid-io/druid/pull/4751. > >>>> > >>>> 2) Depending on druid-sql means Calcite doesn't need to implement its > >> own > >>>> Druid query and result serde code. Druid already has it. > >>>> > >>>> 3) Focused effort on a single module rather than the split effort that > >> we > >>>> have today, where some developers are contributing to druid-sql and > some > >>>> are contributing to calcite-druid. > >>>> > >>>> 4) More test coverage for both projects, presumably. > >>>> > >>>> I think (3) and (4) especially would give us the opportunity to > improve > >>>> both projects much more rapidly. > >>>> > >>>> However, there are also some possible disadvantages. > >>>> > >>>> 1) druid-sql is a somewhat heavyweight module. It pulls in a lot of > >> other > >>>> Druid code. Calcite users may prefer a lighter weight module. > >>>> > >>>> 2) druid-sql's APIs are not intended to be stable, and probably never > >> will > >>>> be. They may break on minor releases. So updating the version of > >> druid-sql > >>>> in Calcite may involve tweaking how functions are called, etc. I think > >> this > >>>> effort should be minimal if calcite-druid is mostly just delegating to > >>>> druid-sql. > >>>> > >>>> 3) druid-sql depends on calcite-core. This should usually be fine, but > >> it > >>>> means that if calcite-core has a breaking change, then calcite-druid > >> cannot > >>>> update its version of druid-sql until druid-sql first updates its > >> version > >>>> of calcite-core. > >>>> > >>>> Despite these potential difficulties, I think the potential benefit > >> means > >>>> this is worth exploring. > >>>> > >>>> Finally: a hypothetical. Why not do the other way around -- have Druid > >> add > >>>> calcite-druid as a dependency? The main reason is that this makes the > >> Druid > >>>> development process awkward when a new Druid SQL feature also > requires a > >>>> new native query feature. Today, we develop the native query and SQL > >> sides > >>>> together. If Druid depended on calcite-druid, then we would need to > >> develop > >>>> the native query side first, then release it, then update Calcite's > >> Druid > >>>> adapter, then pull that back into Druid. Generally, just adding an > extra > >>>> rule in druid-sql wouldn't be enough, since the sorts of changes we > are > >>>> making at this point are typically more extensive than just adjusting > >>>> rules. > >>>> > >>>> Gian > >>>> > >> > >
