Re: Using Calcite for Apache Flink's higher-level languages

Julian Hyde Sat, 12 Dec 2015 14:02:56 -0800

The FreeMarker pre-processing is done using a few maven macros. Drill
pioneered this approach and Phoenix was able to apply the same code.
If you follow that I did for PHOENIX-1706 [1] [2] it should be
straightforward. Note that PhoenixPrepare overrides the method that
creates the parser factory.


Julian

[1] https://issues.apache.org/jira/browse/PHOENIX-1706

[2] 
https://github.com/julianhyde/phoenix/commit/1c0a989ecbd9448709e5682d78a7dab3a9332547

On Fri, Dec 11, 2015 at 9:20 AM, Stephan Ewen <[email protected]> wrote:
> Hi Julian!
>
> Thanks for the explanation. We'll definitely go with embedding Calcite for
> now.
>
> As long as DAGs are supported it is fine is the costs are double counted.
> In most cases, that probably affects only the absolute cost estimate, and
> the relative ranking among plans is still good.
>
> We'll look into FreeMaker for grammar pre-processing and come back when we
> need help.
>
> Greetings,
> Stephan
>
>
> On Thu, Dec 10, 2015 at 10:29 AM, Julian Hyde <[email protected]> wrote:
>
>> Stephan,
>>
>> I’m pleased and flattered to hear that you are considering Calcite. Flink
>> is an excellent project and we would be excited to work with you folks.
>>
>> Your plan to translate both table API and SQL to Calcite operator tree
>> makes sense. For the table API, you could take a look at RelBuilder: it is
>> a concise and user-friendly way of creating operator trees.
>>
>> We haven’t done much work on DAGs to date. You can definitely represent a
>> DAG as an operator tree — because Calcite’s optimizer uses dynamic
>> programming, even a query as simple as ‘select * from t, t’ will create a
>> DAG, with both sides of the join using the same table scan — but the
>> problem is that the cost will be double-counted: Calcite will cost the plan
>> as if each path through the plan was evaluated from scratch. We have some
>> ideas how to fix that, such as a Spool operator [1] that creates temporary
>> tables that, once populated, can be used multiple times within the plan. I
>> also think approaches based on integer linear programming [2] are promising.
>>
>> Regarding extending the SQL grammar. The Drill project put in place a
>> mechanism that uses FreeMarker as a pre-processor for the grammar source
>> file. A project can contribute a patch to Calcite that inserts a macro into
>> Calcite's grammar file and defines a default value for that macro; its own
>> parser would substitute different values in those places. Phoenix has used
>> the same mechanism. Flink could use this same mechanism to extend Calcite
>> SQL any way it chooses.
>>
>> Having said that, we are open to including your extensions in Calcite’s
>> core SQL. This is especially true of streaming SQL, where we are trying to
>> support at least the common cases (tumbling, hopping, sliding windows). In
>> my opinion, the SQL to evaluate a hopping window on streams in Flink, Storm
>> or Samza, or on a historical table in Drill or Phoenix, should be the same.
>> Maybe you have novel operators that start off as Flink-specific extensions
>> and eventually migrate into core SQL.
>>
>> One other thing I’d like to mention. I have been thinking for a while of
>> introducing an “engine” SPI into Calcite, an engine being a system that has
>> implementations of all core relational operators and a runtime (perhaps
>> distributed) to execute them. Currently the only way for a system such as
>> Flink to use Calcite is to embed it. Flink starts up first, and invokes
>> Calcite for parsing and planning. But for some uses, you want to start
>> Calcite first (perhaps embedded) and Calcite is configured to use Flink for
>> all operators that cannot be pushed down to their data source.
>>
>> I think now would be a good time to introduce the engine SPI, with initial
>> implementations for Flink, Drill and perhaps Spark. To be clear, the main
>> way for Drill and (I presume) Flink to use Calcite would be by embedding,
>> but the engine SPI would give our users more configuration flexibility. It
>> would also allow us to better test Calcite-Flink and Calcite-Drill
>> integration before we release.
>>
>> Julian
>>
>> [1] https://issues.apache.org/jira/browse/CALCITE-481
>>
>> [2] http://www.vldb.org/pvldb/vol7/p217-karanasos.pdf
>>
>>
>> > On Dec 9, 2015, at 7:02 AM, Stephan Ewen <[email protected]> wrote:
>> >
>> > Hi Calcite Folks!
>> >
>> > The Apache Flink community is currently looking into how to use Calcite
>> for
>> > optimization of both batch and streaming programs.
>> >
>> > We are looking to compile two different kinds of higher level APIs via
>> > Calcite to Flink's APIs:
>> >  - Table API (a LINQ-style DSL)
>> >  - SQL
>> >
>> > Our current thought is to use largely the same translation paths for
>> both,
>> > with different entry points into Calcite:
>> >  - the Table API creates a Calcite operator tree directly
>> >  - the SQL interface goes through the full stack, including parser, ...
>> >
>> >
>> > From what I have seen so far in Calcite, it looks pretty promising, with
>> > its configurable and extensible rule set, and the pluggable
>> schema/metadata
>> > providers.
>> >
>> > A few questions remain for us, to see how feasible this is:
>> >
>> > 1) Are DAG programs supported? The table API produces operator DAGs,
>> rather
>> > than pure trees. Do DAGs affect/limit the space explored by the optimizer
>> > engine?
>> >
>> > 2) For streaming programs, we will probably want to add some custom
>> syntax,
>> > specific to Flink's windows. Is it possible to also customize the SQL
>> > dialect of the parser?
>> >
>> > These answers are quite crucial for us figure out how to best use Calcite
>> > in our designs. Thanks for helping us...
>> >
>> > Greetings,
>> > Stephan
>>
>>

Re: Using Calcite for Apache Flink's higher-level languages

Reply via email to