Re: Embed druid-sql inside Calcite?

Gian Merlino Wed, 07 Feb 2018 18:05:22 -0800

So, it sounds like the first thing to look at would be seeing if the Hive
folks are open to using druid-sql instead of calcite-druid. What'd be the
best way to go about that? Nishant- do you think you could help?


Gian

On Wed, Feb 7, 2018 at 3:46 PM, Julian Hyde <[email protected]> wrote:

> Long term there doesn’t seem to be any point keeping Calcite’s druid
> adapter around. The code would be an inferior duplicate of druid-sql, so we
> would want to
>
> But shorter term there will be quite a few things that Hive needs that
> will only exist in Calcite’s druid adapter. The challenge will be the
> transition. You will need to convince the Hive developers that the move is
> worthwhile. (It will help if you can point to some quick benefits to making
> the transition.)
>
> Julian
>
>
> > On Feb 7, 2018, at 2:59 PM, Gian Merlino <[email protected]> wrote:
> >
> > In the world where druid-sql is where Druid's Calcite API lives, what do
> > you think would make the most sense for the current calcite-druid module?
> > Would it make sense to remove it (and merge anything it does, that
> > druid-sql doesn't already do, into druid-sql) or to keep it in the
> Calcite
> > project but have it be a thin wrapper over druid-sql?
> >
> > I guess this should be informed by who the users of calcite-druid are. At
> > this point, I don't know much beyond the fact that Hive uses it.
> >
> > Gian
> >
> > On Wed, Feb 7, 2018 at 10:29 AM, Julian Hyde <[email protected]> wrote:
> >
> >> I agree with you both.
> >>
> >> For a particular engine, such as Druid, there are often 3 options:
> >>
> >> 1. build a Calcite adapter to the engine's native query language;
> >>
> >> 2. if the engine supports SQL, connect to the engine via Calcite's JDBC
> >> adapter;
> >>
> >> 3. if the engine exposes an API based on Calcite algebra, connect to
> that
> >> API.
> >>
> >> All of those options are valid for Druid right now, and 3 (Gian's
> >> proposal) is likely to yield the best plans. As Gian correctly notes,
> >> that is likely to increase the coupling, but we can live with that.
> >> (If people want loose coupling they can talk to Druid via the JDBC
> >> adapter, and we just need to make sure that the Druid JDBC dialect
> >> knows that Druid cannot do joins.)
> >>
> >> Nishant's core point seems to be that we need some kind of bulk
> >> API/protocol to talk to Druid, to consume partial query results in
> >> parallel. This is desirable because Hive is  -- how to put it
> >> politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and
> >> Drill would want a similar API/protocol. When it exists, we can
> >> generate a hybrid plan: Druid physical algebra that generates partial
> >> results in parallel underneath Hive physical algebra that consumes
> >> those results in parallel.
> >>
> >> The same pattern occurred in Phoenix. Phoenix does not have
> >> shuffle/exchange capabilities, so for big analytic queries we would
> >> want to couple it with Hive/Spark/Presto/Drill. We talked about
> >> Drillix (Drill + Phoenix) for a while but never completed it.
> >>
> >> Julian
> >>
> >>
> >> On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa
> >> <[email protected]> wrote:
> >>> Having a focused effort into a single project would be great and would
> >>> definitely help us in evolving druid sql capabilities faster.
> >>>
> >>> 1) One more thing that we need to consider here is that calcite
> >>> druid-adapter is also used in Apache Hive where we use the druid rules
> to
> >>> generate an optimized plan and then the druid query is executed from
> >> druid
> >>> containers. In druid-sql I believe the query execution logic is tied to
> >> the
> >>> fact that execution node is a druid-broker where native queries can be
> >> run
> >>> to generate a Sequence of results. We might need some rework there to
> >>> ensure that things work fine with hive too after proposed changes.
> >>>
> >>> 2) druid-sql dependencies can probably be reduced by separating the
> >>> planning and execution logic in druid-sql, the planning logic need not
> >>> depend on lots of druid code and can have light-weight dependencies
> while
> >>> the execution part and result serde which pulls in lots of druid
> >>> dependencies can reside in separate module and calcite druid-adapter
> need
> >>> not depend on that module.
> >>>
> >>> I think, the hypothetical case you mentioned is also worth considering,
> >> to
> >>> ease up the development process, we can consider moving calcite-druid
> as
> >> a
> >>> module in druid, so that we make release of both druid-sql and
> >>> calcite-adapter together.
> >>>
> >>> On Wed, 7 Feb 2018 at 09:02 Gian Merlino <[email protected]> wrote:
> >>>
> >>>> Hi Calcites,
> >>>>
> >>>> I would like to raise the idea of adding druid-sql (
> >>>>
> >>>> http://search.maven.org/#artifactdetails%7Cio.druid%
> >> 7Cdruid-sql%7C0.11.0%7Cjar
> >>>> )
> >>>> as a dependency in Calcite's Druid adapter. It should reduce the size
> of
> >>>> calcite-druid substantially, since it would mostly just be calling
> into
> >>>> druid-sql.
> >>>>
> >>>> This has some advantages for both projects.
> >>>>
> >>>> 1) Support for new Druid features often appears in Druid SQL first. By
> >>>> embedding druid-sql, Calcite gets these new features too, without
> extra
> >>>> work. For example https://issues.apache.org/jira/browse/CALCITE-2170
> >> is an
> >>>> outstanding jira to add support for Druid expressions to Calcite, but
> >>>> druid-sql already supports these. In fact it looks like some of the
> >> code in
> >>>> the proposed patch is copied from druid-sql. As another example,
> >>>> https://issues.apache.org/jira/browse/CALCITE-2077 switched table
> scans
> >>>> from "select" to "scan", which had been previously done in Druid SQL
> in
> >>>> https://github.com/druid-io/druid/pull/4751.
> >>>>
> >>>> 2) Depending on druid-sql means Calcite doesn't need to implement its
> >> own
> >>>> Druid query and result serde code. Druid already has it.
> >>>>
> >>>> 3) Focused effort on a single module rather than the split effort that
> >> we
> >>>> have today, where some developers are contributing to druid-sql and
> some
> >>>> are contributing to calcite-druid.
> >>>>
> >>>> 4) More test coverage for both projects, presumably.
> >>>>
> >>>> I think (3) and (4) especially would give us the opportunity to
> improve
> >>>> both projects much more rapidly.
> >>>>
> >>>> However, there are also some possible disadvantages.
> >>>>
> >>>> 1) druid-sql is a somewhat heavyweight module. It pulls in a lot of
> >> other
> >>>> Druid code. Calcite users may prefer a lighter weight module.
> >>>>
> >>>> 2) druid-sql's APIs are not intended to be stable, and probably never
> >> will
> >>>> be. They may break on minor releases. So updating the version of
> >> druid-sql
> >>>> in Calcite may involve tweaking how functions are called, etc. I think
> >> this
> >>>> effort should be minimal if calcite-druid is mostly just delegating to
> >>>> druid-sql.
> >>>>
> >>>> 3) druid-sql depends on calcite-core. This should usually be fine, but
> >> it
> >>>> means that if calcite-core has a breaking change, then calcite-druid
> >> cannot
> >>>> update its version of druid-sql until druid-sql first updates its
> >> version
> >>>> of calcite-core.
> >>>>
> >>>> Despite these potential difficulties, I think the potential benefit
> >> means
> >>>> this is worth exploring.
> >>>>
> >>>> Finally: a hypothetical. Why not do the other way around -- have Druid
> >> add
> >>>> calcite-druid as a dependency? The main reason is that this makes the
> >> Druid
> >>>> development process awkward when a new Druid SQL feature also
> requires a
> >>>> new native query feature. Today, we develop the native query and SQL
> >> sides
> >>>> together. If Druid depended on calcite-druid, then we would need to
> >> develop
> >>>> the native query side first, then release it, then update Calcite's
> >> Druid
> >>>> adapter, then pull that back into Druid. Generally, just adding an
> extra
> >>>> rule in druid-sql wouldn't be enough, since the sorts of changes we
> are
> >>>> making at this point are typically more extensive than just adjusting
> >>>> rules.
> >>>>
> >>>> Gian
> >>>>
> >>
>
>

Re: Embed druid-sql inside Calcite?

Reply via email to