Fwd: [druid-dev] [Proposal] Built-in SQL for Druid

Julian Hyde Wed, 12 Oct 2016 11:02:05 -0700

FYI - Druid devs are proposing to embed Calcite in Druid, so that Druid has a 
native SQL interface.


I think Gian’s plan makes sense; its architecture is similar in a lot of ways 
with how we are integrating with Phoenix. Calcite’s Druid adapter will still 
exist, still be useful, and in fact I expect that Druid devs will end up 
building on it.

I replied on the thread: 
https://groups.google.com/forum/?pli=1#!topic/druid-development/3npt9Qxpjr0 
<https://groups.google.com/forum/?pli=1#!topic/druid-development/3npt9Qxpjr0> 

Julian

> Begin forwarded message:
> 
> From: Gian Merlino <[email protected]>
> Subject: [druid-dev] [Proposal] Built-in SQL for Druid
> Date: October 12, 2016 at 9:50:48 AM PDT
> To: [email protected]
> Reply-To: [email protected]
> 
> Inspired by the Calcite Druid adapter 
> (https://groups.google.com/d/topic/druid-development/FK5D162ao74/discussion 
> <https://groups.google.com/d/topic/druid-development/FK5D162ao74/discussion>) 
> I've been playing around with something similar that lives inside of the 
> Druid Broker. It seems promising, so in this proposal I'm suggesting we 
> include an official SQL server inside Druid itself.
> 
> I am hoping that we can:
> 
> 1) Use Calcite for SQL parsing and optimizing, and use Avatica 
> (https://calcite.apache.org/docs/avatica_overview.html 
> <https://calcite.apache.org/docs/avatica_overview.html>) for the server and 
> the JDBC client.
> 2) Like the official Calcite Druid adapter, have a set of rules that push 
> down filters, projections, aggregations, sorts, etc into normal Druid queries.
> 3) Unlike the official Calcite Druid adapter, use Druid objects (like 
> DimFilter, ExtractionFn, etc) as model classes, since it avoids extra code, 
> helps with type safety, and speeds up development.
> 4) Have this all run on the Broker, which would then make normal Druid 
> queries to data nodes.
> 5) Work towards being able to push down more and more SQL into normal Druid 
> queries over time.
> 
> Current status
> 
> If people are interested in this proposal then I'll clean up the code a bit 
> and do a PR. Currently it's a rough prototype. Some things that do work:
> 
> 1) Avatica handler running at /druid/v2/sql/ + Avatica JDBC driver
> 2) Determining column types with segment metadata queries
> 3) Pushing down operator sequences that look like filter -> project -> 
> aggregate -> project -> sort into groupBy, timeseries, and select queries as 
> appropriate
> 4) Using "intervals" to filter on time when appropriate
> 5) LIKE, range, equality, and boolean filters
> 6) SUM, MIN, MAX, AVG, COUNT, COUNT DISTINCT
> 7) Some extraction fns like SUBSTRING, CHAR_LENGTH
> 8) GROUP BY FLOOR(__time TO gran) for time-series
> 9) Arithmetic post-aggregations
> 10) Filtered aggregations using CASE or using FILTER(WHERE ...)
> 11) Semi-joins like SELECT ... WHERE xxx IN (SELECT ...) can run by 
> materializing the inner result on the broker and applying it to the outer 
> query as a filter. Obviously doesn't always work, but it works sometimes (and 
> it works more often than pulling the lefthand side into the Broker…).
> 
> Non-exhaustive list of things that don't work:
> 
> 1) Pushing down filter after aggregate (HAVING)
> 2) Push down of anything without a native Druid analog, like multi-column 
> extraction fns, aggregation of expressions, window functions, etc.
> 3) Any extraction fns other than SUBSTRING, CHAR_LENGTH
> 4) A lot of time stuff, like x + INTERVAL, FLOOR(__time TO MONTH) = x, etc.
> 5) Query time lookups
> 6) Select with pagination – only the first 1000 results are used
> 7) Any sort of memory usage controls on the Broker side
> 
> FAQ
> 
> 1) Why another SQL on Druid thing? There's already, like, 7 of them.
> 
> I think the fact that there are 7 of them means there's clearly some value in 
> having a built-in implementation. Partially this is so we can hopefully share 
> some work between the projects. Partially this is because Druid doesn't 
> support some things that are needed for well rounded SQL support (like 
> multi-column extraction fns, aggregations of expressions, etc) and having the 
> SQL layer inside the Druid repo will make it possible to develop those sorts 
> of features hand in hand with the SQL planner rules.
> 
> Btw, the 7 that I counted are, in alphabetical order, Calcite 
> (https://calcite.apache.org/docs/druid_adapter.html 
> <https://calcite.apache.org/docs/druid_adapter.html>), Drill 
> (https://groups.google.com/d/msg/druid-development/FK5D162ao74/EnYDjASWCQAJ 
> <https://groups.google.com/d/msg/druid-development/FK5D162ao74/EnYDjASWCQAJ>),
>  Druid's own simple grammar (added in 2013, removed in 
> https://github.com/druid-io/druid/pull/2090 
> <https://github.com/druid-io/druid/pull/2090>), Hive 
> (https://cwiki.apache.org/confluence/display/Hive/Druid+Integration 
> <https://cwiki.apache.org/confluence/display/Hive/Druid+Integration>), PlyQL 
> (http://plywood.imply.io/plyql <http://plywood.imply.io/plyql>), Sparkline 
> (https://github.com/SparklineData/spark-druid-olap 
> <https://github.com/SparklineData/spark-druid-olap>), and Sql4D 
> (https://github.com/srikalyc/Sql4D <https://github.com/srikalyc/Sql4D>).
> 
> 2) Is the proposed SQL language actually SQL or is it "SQL-like"?
> 
> In terms of what can be efficiently pushed down to Druid queries, it's 
> "SQL-like". A lot of common SQL features aren't supported – although I think 
> it makes sense to add more over time. Technically Calcite does speak full 
> SQL, but a lot of it at the start would get planned as pulling all the raw 
> data into the Broker and processing it in Calcite's interpreter.
> 
> 3) Why not use the Druid adapter in Calcite?
> 
> Calcite's Druid adapter doesn't depend on any Druid jars; it implements the 
> query language and protocol using its own set of model and client classes. 
> For a builtin approach I wanted to be able to use Druid's own Query, 
> DimFilter, ExtractionFn, etc in a type-safe way, and wanted to use the query 
> code that already exists in Druid for discovering and querying data nodes. I 
> think this will also help speed up development of Druid features that allow 
> more SQL to be pushed down.
> 
> 4) Can we share work between a builtin Druid SQL and the other SQL on Druid 
> adapters that people are working on?
> 
> Hopefully! I think it would make sense if a builtin Druid SQL could be used 
> for whatever Druid supports natively, and external SQL on Druid adapters 
> could be used when users want to do something that Druid doesn't support. 
> Sharing the work needed to translate "whatever Druid supports natively" into 
> Druid queries would help everyone.
> 
> Hive and Drill already use Calcite internally, and I hope it's workable to 
> stuff Druid's own rules into their planners without changing too much. If 
> those projects are comfortable embedding druid-server then that should work 
> straight away. If they aren't comfortable embedding druid-server (perhaps 
> understandably) then we could bite the bullet and work on a light(er) weight 
> druid-client jar that has just enough to give us the benefit of type 
> checking, and does not include all the heavy Druid functionality.
> 
> If you're working on one of those projects, feedback is greatly appreciated.
> 
> 5) What happens when parts of the SQL query can't be converted to a native 
> Druid query?
> 
> Calcite is rad and runs the parts that can't be pushed down through an 
> interpreter on the Druid Broker. Of course this means that if you use 
> constructs that are close to the data and can't be pushed down, like grouping 
> on CONCAT(foo, bar) or aggregating SUM(3 * bar), potentially a surprisingly 
> large amount of data will be pulled out into the Broker. This is not great 
> behavior and something should be done about that…
> 
> 6) What about JOINs?
> 
> I don't know, maybe it makes sense for Druid to have query types usable for 
> joins in the future. But it doesn't now; the closest thing is query-time 
> lookups, which is like a broadcast join. Without native join support in 
> Druid, it makes more sense to pull data out of Druid into another system 
> (like Drill or Hive or Spark) and do the join there. Even if Druid did 
> support native joins, there's still some value in using an external execution 
> engine to join Druid data with data from some other system. Filters and 
> aggregations can still potentially be pushed down, depending on the query.
> 
> 7) JDBC works, but what about ODBC?
> 
> Avatica's home page says work on an ODBC client has not yet started. The page 
> at https://hortonworks.com/hadoop-tutorial/bi-apache-phoenix-odbc/ 
> <https://hortonworks.com/hadoop-tutorial/bi-apache-phoenix-odbc/> is 
> interesting, since the Phoenix Query Server is also Avatica based, so maybe 
> that work could be useful? However, it doesn't seem to be open source, and 
> when I tried to get the binary to work, the Windows ODBC setup thing crashed 
> after calling getTableTypes. Maybe someone at Hortonworks can comment :)
> 
> Gian
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To post to this group, send email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/druid-development/CACZNdYAVv8WLP1Qw4dzzf60e-3CwzP_%3DFmWeOd_2OvPsfcw8Ag%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/druid-development/CACZNdYAVv8WLP1Qw4dzzf60e-3CwzP_%3DFmWeOd_2OvPsfcw8Ag%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

Fwd: [druid-dev] [Proposal] Built-in SQL for Druid

Reply via email to