Re: Using Calcite for Apache Flink's higher-level languages

Julian Hyde Thu, 10 Dec 2015 01:30:07 -0800

Stephan,

I’m pleased and flattered to hear that you are considering Calcite. Flink is an 
excellent project and we would be excited to work with you folks.

Your plan to translate both table API and SQL to Calcite operator tree makes 
sense. For the table API, you could take a look at RelBuilder: it is a concise 
and user-friendly way of creating operator trees.

We haven’t done much work on DAGs to date. You can definitely represent a DAG 
as an operator tree — because Calcite’s optimizer uses dynamic programming, 
even a query as simple as ‘select * from t, t’ will create a DAG, with both 
sides of the join using the same table scan — but the problem is that the cost 
will be double-counted: Calcite will cost the plan as if each path through the 
plan was evaluated from scratch. We have some ideas how to fix that, such as a 
Spool operator [1] that creates temporary tables that, once populated, can be 
used multiple times within the plan. I also think approaches based on integer 
linear programming [2] are promising.

Regarding extending the SQL grammar. The Drill project put in place a mechanism 
that uses FreeMarker as a pre-processor for the grammar source file. A project 
can contribute a patch to Calcite that inserts a macro into Calcite's grammar 
file and defines a default value for that macro; its own parser would 
substitute different values in those places. Phoenix has used the same 
mechanism. Flink could use this same mechanism to extend Calcite SQL any way it 
chooses.

Having said that, we are open to including your extensions in Calcite’s core 
SQL. This is especially true of streaming SQL, where we are trying to support 
at least the common cases (tumbling, hopping, sliding windows). In my opinion, 
the SQL to evaluate a hopping window on streams in Flink, Storm or Samza, or on 
a historical table in Drill or Phoenix, should be the same. Maybe you have 
novel operators that start off as Flink-specific extensions and eventually 
migrate into core SQL.

One other thing I’d like to mention. I have been thinking for a while of 
introducing an “engine” SPI into Calcite, an engine being a system that has 
implementations of all core relational operators and a runtime (perhaps 
distributed) to execute them. Currently the only way for a system such as Flink 
to use Calcite is to embed it. Flink starts up first, and invokes Calcite for 
parsing and planning. But for some uses, you want to start Calcite first 
(perhaps embedded) and Calcite is configured to use Flink for all operators 
that cannot be pushed down to their data source. 

I think now would be a good time to introduce the engine SPI, with initial 
implementations for Flink, Drill and perhaps Spark. To be clear, the main way 
for Drill and (I presume) Flink to use Calcite would be by embedding, but the 
engine SPI would give our users more configuration flexibility. It would also 
allow us to better test Calcite-Flink and Calcite-Drill integration before we 
release.

Julian

[1] https://issues.apache.org/jira/browse/CALCITE-481

[2] http://www.vldb.org/pvldb/vol7/p217-karanasos.pdf

> On Dec 9, 2015, at 7:02 AM, Stephan Ewen <[email protected]> wrote:
> 
> Hi Calcite Folks!
> 
> The Apache Flink community is currently looking into how to use Calcite for
> optimization of both batch and streaming programs.
> 
> We are looking to compile two different kinds of higher level APIs via
> Calcite to Flink's APIs:
>  - Table API (a LINQ-style DSL)
>  - SQL
> 
> Our current thought is to use largely the same translation paths for both,
> with different entry points into Calcite:
>  - the Table API creates a Calcite operator tree directly
>  - the SQL interface goes through the full stack, including parser, ...
> 
> 
> From what I have seen so far in Calcite, it looks pretty promising, with
> its configurable and extensible rule set, and the pluggable schema/metadata
> providers.
> 
> A few questions remain for us, to see how feasible this is:
> 
> 1) Are DAG programs supported? The table API produces operator DAGs, rather
> than pure trees. Do DAGs affect/limit the space explored by the optimizer
> engine?
> 
> 2) For streaming programs, we will probably want to add some custom syntax,
> specific to Flink's windows. Is it possible to also customize the SQL
> dialect of the parser?
> 
> These answers are quite crucial for us figure out how to best use Calcite
> in our designs. Thanks for helping us...
> 
> Greetings,
> Stephan

Re: Using Calcite for Apache Flink's higher-level languages

Reply via email to