Re: Embed druid-sql inside Calcite?
So, it sounds like the first thing to look at would be seeing if the Hive folks are open to using druid-sql instead of calcite-druid. What'd be the best way to go about that? Nishant- do you think you could help? Gian On Wed, Feb 7, 2018 at 3:46 PM, Julian Hydewrote: > Long term there doesn’t seem to be any point keeping Calcite’s druid > adapter around. The code would be an inferior duplicate of druid-sql, so we > would want to > > But shorter term there will be quite a few things that Hive needs that > will only exist in Calcite’s druid adapter. The challenge will be the > transition. You will need to convince the Hive developers that the move is > worthwhile. (It will help if you can point to some quick benefits to making > the transition.) > > Julian > > > > On Feb 7, 2018, at 2:59 PM, Gian Merlino wrote: > > > > In the world where druid-sql is where Druid's Calcite API lives, what do > > you think would make the most sense for the current calcite-druid module? > > Would it make sense to remove it (and merge anything it does, that > > druid-sql doesn't already do, into druid-sql) or to keep it in the > Calcite > > project but have it be a thin wrapper over druid-sql? > > > > I guess this should be informed by who the users of calcite-druid are. At > > this point, I don't know much beyond the fact that Hive uses it. > > > > Gian > > > > On Wed, Feb 7, 2018 at 10:29 AM, Julian Hyde wrote: > > > >> I agree with you both. > >> > >> For a particular engine, such as Druid, there are often 3 options: > >> > >> 1. build a Calcite adapter to the engine's native query language; > >> > >> 2. if the engine supports SQL, connect to the engine via Calcite's JDBC > >> adapter; > >> > >> 3. if the engine exposes an API based on Calcite algebra, connect to > that > >> API. > >> > >> All of those options are valid for Druid right now, and 3 (Gian's > >> proposal) is likely to yield the best plans. As Gian correctly notes, > >> that is likely to increase the coupling, but we can live with that. > >> (If people want loose coupling they can talk to Druid via the JDBC > >> adapter, and we just need to make sure that the Druid JDBC dialect > >> knows that Druid cannot do joins.) > >> > >> Nishant's core point seems to be that we need some kind of bulk > >> API/protocol to talk to Druid, to consume partial query results in > >> parallel. This is desirable because Hive is -- how to put it > >> politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and > >> Drill would want a similar API/protocol. When it exists, we can > >> generate a hybrid plan: Druid physical algebra that generates partial > >> results in parallel underneath Hive physical algebra that consumes > >> those results in parallel. > >> > >> The same pattern occurred in Phoenix. Phoenix does not have > >> shuffle/exchange capabilities, so for big analytic queries we would > >> want to couple it with Hive/Spark/Presto/Drill. We talked about > >> Drillix (Drill + Phoenix) for a while but never completed it. > >> > >> Julian > >> > >> > >> On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa > >> wrote: > >>> Having a focused effort into a single project would be great and would > >>> definitely help us in evolving druid sql capabilities faster. > >>> > >>> 1) One more thing that we need to consider here is that calcite > >>> druid-adapter is also used in Apache Hive where we use the druid rules > to > >>> generate an optimized plan and then the druid query is executed from > >> druid > >>> containers. In druid-sql I believe the query execution logic is tied to > >> the > >>> fact that execution node is a druid-broker where native queries can be > >> run > >>> to generate a Sequence of results. We might need some rework there to > >>> ensure that things work fine with hive too after proposed changes. > >>> > >>> 2) druid-sql dependencies can probably be reduced by separating the > >>> planning and execution logic in druid-sql, the planning logic need not > >>> depend on lots of druid code and can have light-weight dependencies > while > >>> the execution part and result serde which pulls in lots of druid > >>> dependencies can reside in separate module and calcite druid-adapter > need > >>> not depend on that module. > >>> > >>> I think, the hypothetical case you mentioned is also worth considering, > >> to > >>> ease up the development process, we can consider moving calcite-druid > as > >> a > >>> module in druid, so that we make release of both druid-sql and > >>> calcite-adapter together. > >>> > >>> On Wed, 7 Feb 2018 at 09:02 Gian Merlino wrote: > >>> > Hi Calcites, > > I would like to raise the idea of adding druid-sql ( > > http://search.maven.org/#artifactdetails%7Cio.druid% > >> 7Cdruid-sql%7C0.11.0%7Cjar > ) > as a dependency in Calcite's Druid adapter. It should reduce the size > of > calcite-druid
Re: Embed druid-sql inside Calcite?
Long term there doesn’t seem to be any point keeping Calcite’s druid adapter around. The code would be an inferior duplicate of druid-sql, so we would want to But shorter term there will be quite a few things that Hive needs that will only exist in Calcite’s druid adapter. The challenge will be the transition. You will need to convince the Hive developers that the move is worthwhile. (It will help if you can point to some quick benefits to making the transition.) Julian > On Feb 7, 2018, at 2:59 PM, Gian Merlinowrote: > > In the world where druid-sql is where Druid's Calcite API lives, what do > you think would make the most sense for the current calcite-druid module? > Would it make sense to remove it (and merge anything it does, that > druid-sql doesn't already do, into druid-sql) or to keep it in the Calcite > project but have it be a thin wrapper over druid-sql? > > I guess this should be informed by who the users of calcite-druid are. At > this point, I don't know much beyond the fact that Hive uses it. > > Gian > > On Wed, Feb 7, 2018 at 10:29 AM, Julian Hyde wrote: > >> I agree with you both. >> >> For a particular engine, such as Druid, there are often 3 options: >> >> 1. build a Calcite adapter to the engine's native query language; >> >> 2. if the engine supports SQL, connect to the engine via Calcite's JDBC >> adapter; >> >> 3. if the engine exposes an API based on Calcite algebra, connect to that >> API. >> >> All of those options are valid for Druid right now, and 3 (Gian's >> proposal) is likely to yield the best plans. As Gian correctly notes, >> that is likely to increase the coupling, but we can live with that. >> (If people want loose coupling they can talk to Druid via the JDBC >> adapter, and we just need to make sure that the Druid JDBC dialect >> knows that Druid cannot do joins.) >> >> Nishant's core point seems to be that we need some kind of bulk >> API/protocol to talk to Druid, to consume partial query results in >> parallel. This is desirable because Hive is -- how to put it >> politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and >> Drill would want a similar API/protocol. When it exists, we can >> generate a hybrid plan: Druid physical algebra that generates partial >> results in parallel underneath Hive physical algebra that consumes >> those results in parallel. >> >> The same pattern occurred in Phoenix. Phoenix does not have >> shuffle/exchange capabilities, so for big analytic queries we would >> want to couple it with Hive/Spark/Presto/Drill. We talked about >> Drillix (Drill + Phoenix) for a while but never completed it. >> >> Julian >> >> >> On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa >> wrote: >>> Having a focused effort into a single project would be great and would >>> definitely help us in evolving druid sql capabilities faster. >>> >>> 1) One more thing that we need to consider here is that calcite >>> druid-adapter is also used in Apache Hive where we use the druid rules to >>> generate an optimized plan and then the druid query is executed from >> druid >>> containers. In druid-sql I believe the query execution logic is tied to >> the >>> fact that execution node is a druid-broker where native queries can be >> run >>> to generate a Sequence of results. We might need some rework there to >>> ensure that things work fine with hive too after proposed changes. >>> >>> 2) druid-sql dependencies can probably be reduced by separating the >>> planning and execution logic in druid-sql, the planning logic need not >>> depend on lots of druid code and can have light-weight dependencies while >>> the execution part and result serde which pulls in lots of druid >>> dependencies can reside in separate module and calcite druid-adapter need >>> not depend on that module. >>> >>> I think, the hypothetical case you mentioned is also worth considering, >> to >>> ease up the development process, we can consider moving calcite-druid as >> a >>> module in druid, so that we make release of both druid-sql and >>> calcite-adapter together. >>> >>> On Wed, 7 Feb 2018 at 09:02 Gian Merlino wrote: >>> Hi Calcites, I would like to raise the idea of adding druid-sql ( http://search.maven.org/#artifactdetails%7Cio.druid% >> 7Cdruid-sql%7C0.11.0%7Cjar ) as a dependency in Calcite's Druid adapter. It should reduce the size of calcite-druid substantially, since it would mostly just be calling into druid-sql. This has some advantages for both projects. 1) Support for new Druid features often appears in Druid SQL first. By embedding druid-sql, Calcite gets these new features too, without extra work. For example https://issues.apache.org/jira/browse/CALCITE-2170 >> is an outstanding jira to add support for Druid expressions to Calcite, but druid-sql already supports these. In
Re: Embed druid-sql inside Calcite?
In the world where druid-sql is where Druid's Calcite API lives, what do you think would make the most sense for the current calcite-druid module? Would it make sense to remove it (and merge anything it does, that druid-sql doesn't already do, into druid-sql) or to keep it in the Calcite project but have it be a thin wrapper over druid-sql? I guess this should be informed by who the users of calcite-druid are. At this point, I don't know much beyond the fact that Hive uses it. Gian On Wed, Feb 7, 2018 at 10:29 AM, Julian Hydewrote: > I agree with you both. > > For a particular engine, such as Druid, there are often 3 options: > > 1. build a Calcite adapter to the engine's native query language; > > 2. if the engine supports SQL, connect to the engine via Calcite's JDBC > adapter; > > 3. if the engine exposes an API based on Calcite algebra, connect to that > API. > > All of those options are valid for Druid right now, and 3 (Gian's > proposal) is likely to yield the best plans. As Gian correctly notes, > that is likely to increase the coupling, but we can live with that. > (If people want loose coupling they can talk to Druid via the JDBC > adapter, and we just need to make sure that the Druid JDBC dialect > knows that Druid cannot do joins.) > > Nishant's core point seems to be that we need some kind of bulk > API/protocol to talk to Druid, to consume partial query results in > parallel. This is desirable because Hive is -- how to put it > politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and > Drill would want a similar API/protocol. When it exists, we can > generate a hybrid plan: Druid physical algebra that generates partial > results in parallel underneath Hive physical algebra that consumes > those results in parallel. > > The same pattern occurred in Phoenix. Phoenix does not have > shuffle/exchange capabilities, so for big analytic queries we would > want to couple it with Hive/Spark/Presto/Drill. We talked about > Drillix (Drill + Phoenix) for a while but never completed it. > > Julian > > > On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa > wrote: > > Having a focused effort into a single project would be great and would > > definitely help us in evolving druid sql capabilities faster. > > > > 1) One more thing that we need to consider here is that calcite > > druid-adapter is also used in Apache Hive where we use the druid rules to > > generate an optimized plan and then the druid query is executed from > druid > > containers. In druid-sql I believe the query execution logic is tied to > the > > fact that execution node is a druid-broker where native queries can be > run > > to generate a Sequence of results. We might need some rework there to > > ensure that things work fine with hive too after proposed changes. > > > > 2) druid-sql dependencies can probably be reduced by separating the > > planning and execution logic in druid-sql, the planning logic need not > > depend on lots of druid code and can have light-weight dependencies while > > the execution part and result serde which pulls in lots of druid > > dependencies can reside in separate module and calcite druid-adapter need > > not depend on that module. > > > > I think, the hypothetical case you mentioned is also worth considering, > to > > ease up the development process, we can consider moving calcite-druid as > a > > module in druid, so that we make release of both druid-sql and > > calcite-adapter together. > > > > On Wed, 7 Feb 2018 at 09:02 Gian Merlino wrote: > > > >> Hi Calcites, > >> > >> I would like to raise the idea of adding druid-sql ( > >> > >> http://search.maven.org/#artifactdetails%7Cio.druid% > 7Cdruid-sql%7C0.11.0%7Cjar > >> ) > >> as a dependency in Calcite's Druid adapter. It should reduce the size of > >> calcite-druid substantially, since it would mostly just be calling into > >> druid-sql. > >> > >> This has some advantages for both projects. > >> > >> 1) Support for new Druid features often appears in Druid SQL first. By > >> embedding druid-sql, Calcite gets these new features too, without extra > >> work. For example https://issues.apache.org/jira/browse/CALCITE-2170 > is an > >> outstanding jira to add support for Druid expressions to Calcite, but > >> druid-sql already supports these. In fact it looks like some of the > code in > >> the proposed patch is copied from druid-sql. As another example, > >> https://issues.apache.org/jira/browse/CALCITE-2077 switched table scans > >> from "select" to "scan", which had been previously done in Druid SQL in > >> https://github.com/druid-io/druid/pull/4751. > >> > >> 2) Depending on druid-sql means Calcite doesn't need to implement its > own > >> Druid query and result serde code. Druid already has it. > >> > >> 3) Focused effort on a single module rather than the split effort that > we > >> have today, where some developers are contributing to druid-sql and some > >> are
Re: Embed druid-sql inside Calcite?
I agree with you both. For a particular engine, such as Druid, there are often 3 options: 1. build a Calcite adapter to the engine's native query language; 2. if the engine supports SQL, connect to the engine via Calcite's JDBC adapter; 3. if the engine exposes an API based on Calcite algebra, connect to that API. All of those options are valid for Druid right now, and 3 (Gian's proposal) is likely to yield the best plans. As Gian correctly notes, that is likely to increase the coupling, but we can live with that. (If people want loose coupling they can talk to Druid via the JDBC adapter, and we just need to make sure that the Druid JDBC dialect knows that Druid cannot do joins.) Nishant's core point seems to be that we need some kind of bulk API/protocol to talk to Druid, to consume partial query results in parallel. This is desirable because Hive is -- how to put it politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and Drill would want a similar API/protocol. When it exists, we can generate a hybrid plan: Druid physical algebra that generates partial results in parallel underneath Hive physical algebra that consumes those results in parallel. The same pattern occurred in Phoenix. Phoenix does not have shuffle/exchange capabilities, so for big analytic queries we would want to couple it with Hive/Spark/Presto/Drill. We talked about Drillix (Drill + Phoenix) for a while but never completed it. Julian On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwawrote: > Having a focused effort into a single project would be great and would > definitely help us in evolving druid sql capabilities faster. > > 1) One more thing that we need to consider here is that calcite > druid-adapter is also used in Apache Hive where we use the druid rules to > generate an optimized plan and then the druid query is executed from druid > containers. In druid-sql I believe the query execution logic is tied to the > fact that execution node is a druid-broker where native queries can be run > to generate a Sequence of results. We might need some rework there to > ensure that things work fine with hive too after proposed changes. > > 2) druid-sql dependencies can probably be reduced by separating the > planning and execution logic in druid-sql, the planning logic need not > depend on lots of druid code and can have light-weight dependencies while > the execution part and result serde which pulls in lots of druid > dependencies can reside in separate module and calcite druid-adapter need > not depend on that module. > > I think, the hypothetical case you mentioned is also worth considering, to > ease up the development process, we can consider moving calcite-druid as a > module in druid, so that we make release of both druid-sql and > calcite-adapter together. > > On Wed, 7 Feb 2018 at 09:02 Gian Merlino wrote: > >> Hi Calcites, >> >> I would like to raise the idea of adding druid-sql ( >> >> http://search.maven.org/#artifactdetails%7Cio.druid%7Cdruid-sql%7C0.11.0%7Cjar >> ) >> as a dependency in Calcite's Druid adapter. It should reduce the size of >> calcite-druid substantially, since it would mostly just be calling into >> druid-sql. >> >> This has some advantages for both projects. >> >> 1) Support for new Druid features often appears in Druid SQL first. By >> embedding druid-sql, Calcite gets these new features too, without extra >> work. For example https://issues.apache.org/jira/browse/CALCITE-2170 is an >> outstanding jira to add support for Druid expressions to Calcite, but >> druid-sql already supports these. In fact it looks like some of the code in >> the proposed patch is copied from druid-sql. As another example, >> https://issues.apache.org/jira/browse/CALCITE-2077 switched table scans >> from "select" to "scan", which had been previously done in Druid SQL in >> https://github.com/druid-io/druid/pull/4751. >> >> 2) Depending on druid-sql means Calcite doesn't need to implement its own >> Druid query and result serde code. Druid already has it. >> >> 3) Focused effort on a single module rather than the split effort that we >> have today, where some developers are contributing to druid-sql and some >> are contributing to calcite-druid. >> >> 4) More test coverage for both projects, presumably. >> >> I think (3) and (4) especially would give us the opportunity to improve >> both projects much more rapidly. >> >> However, there are also some possible disadvantages. >> >> 1) druid-sql is a somewhat heavyweight module. It pulls in a lot of other >> Druid code. Calcite users may prefer a lighter weight module. >> >> 2) druid-sql's APIs are not intended to be stable, and probably never will >> be. They may break on minor releases. So updating the version of druid-sql >> in Calcite may involve tweaking how functions are called, etc. I think this >> effort should be minimal if calcite-druid is mostly just delegating to >> druid-sql. >> >> 3) druid-sql depends on
Re: Embed druid-sql inside Calcite?
I think druid-sql could support the Hive use case without too much reworking. It has a method that returns a Sequence: public abstract Sequence
Re: Embed druid-sql inside Calcite?
Having a focused effort into a single project would be great and would definitely help us in evolving druid sql capabilities faster. 1) One more thing that we need to consider here is that calcite druid-adapter is also used in Apache Hive where we use the druid rules to generate an optimized plan and then the druid query is executed from druid containers. In druid-sql I believe the query execution logic is tied to the fact that execution node is a druid-broker where native queries can be run to generate a Sequence of results. We might need some rework there to ensure that things work fine with hive too after proposed changes. 2) druid-sql dependencies can probably be reduced by separating the planning and execution logic in druid-sql, the planning logic need not depend on lots of druid code and can have light-weight dependencies while the execution part and result serde which pulls in lots of druid dependencies can reside in separate module and calcite druid-adapter need not depend on that module. I think, the hypothetical case you mentioned is also worth considering, to ease up the development process, we can consider moving calcite-druid as a module in druid, so that we make release of both druid-sql and calcite-adapter together. On Wed, 7 Feb 2018 at 09:02 Gian Merlinowrote: > Hi Calcites, > > I would like to raise the idea of adding druid-sql ( > > http://search.maven.org/#artifactdetails%7Cio.druid%7Cdruid-sql%7C0.11.0%7Cjar > ) > as a dependency in Calcite's Druid adapter. It should reduce the size of > calcite-druid substantially, since it would mostly just be calling into > druid-sql. > > This has some advantages for both projects. > > 1) Support for new Druid features often appears in Druid SQL first. By > embedding druid-sql, Calcite gets these new features too, without extra > work. For example https://issues.apache.org/jira/browse/CALCITE-2170 is an > outstanding jira to add support for Druid expressions to Calcite, but > druid-sql already supports these. In fact it looks like some of the code in > the proposed patch is copied from druid-sql. As another example, > https://issues.apache.org/jira/browse/CALCITE-2077 switched table scans > from "select" to "scan", which had been previously done in Druid SQL in > https://github.com/druid-io/druid/pull/4751. > > 2) Depending on druid-sql means Calcite doesn't need to implement its own > Druid query and result serde code. Druid already has it. > > 3) Focused effort on a single module rather than the split effort that we > have today, where some developers are contributing to druid-sql and some > are contributing to calcite-druid. > > 4) More test coverage for both projects, presumably. > > I think (3) and (4) especially would give us the opportunity to improve > both projects much more rapidly. > > However, there are also some possible disadvantages. > > 1) druid-sql is a somewhat heavyweight module. It pulls in a lot of other > Druid code. Calcite users may prefer a lighter weight module. > > 2) druid-sql's APIs are not intended to be stable, and probably never will > be. They may break on minor releases. So updating the version of druid-sql > in Calcite may involve tweaking how functions are called, etc. I think this > effort should be minimal if calcite-druid is mostly just delegating to > druid-sql. > > 3) druid-sql depends on calcite-core. This should usually be fine, but it > means that if calcite-core has a breaking change, then calcite-druid cannot > update its version of druid-sql until druid-sql first updates its version > of calcite-core. > > Despite these potential difficulties, I think the potential benefit means > this is worth exploring. > > Finally: a hypothetical. Why not do the other way around -- have Druid add > calcite-druid as a dependency? The main reason is that this makes the Druid > development process awkward when a new Druid SQL feature also requires a > new native query feature. Today, we develop the native query and SQL sides > together. If Druid depended on calcite-druid, then we would need to develop > the native query side first, then release it, then update Calcite's Druid > adapter, then pull that back into Druid. Generally, just adding an extra > rule in druid-sql wouldn't be enough, since the sorts of changes we are > making at this point are typically more extensive than just adjusting > rules. > > Gian >