Re: [DISCUSS] The state of the project - 2017

Julian Hyde Wed, 08 Nov 2017 13:34:54 -0800

Thanks for starting this discussion, Jesus. Here are some thoughts, in
no particular order.

I too have noticed the increase in academic adoption. This is
excellent. Shall we add a section to the "Powered by" page [1] on
academic projects and papers?

I worry a lot about audience (or audiences). Who is using Calcite? Are
we giving them what they need? Data engines (such as Drill, Hive and
Flink) are one category, and I think they are fairly well served.
Academics are another audience; some are succeeding, but I wonder
whether it would be easier for them if we had some relevant examples,
such as how to parse a query and optimize it using several different
cost models and combinations of rules. What other audiences are there?

There is an audience who would like to use Calcite as a standalone
engine; and folks who would like to incorporate materialized views,
indexes and constraints into their engine but prefer to speak SQL
rather than Java APIs. Those groups are not well served today. I am
working on a server which has DDL support[2][3]; it would provide a
(simple) standalone engine, but also allow us to demo materialized
views, virtual columns, check constraints and foreign tables/schemas
via SQL so that people building engines can more easily grasp the
concepts.

I read Trumer & Koch's paper "Multi-objective parametric query
optimization" [4] in CACM recently. It is a very exciting advance, and
too much to cover in this thread, but it got me thinking about how
Calcite could evolve to incorporate their ideas. I realize that giving
RelOptCost multiple fields was a mistake, unless we also add the
mechanics (piecewise-linear cost functions and polytopes) to handle
them. The vast majority of Calcite remains applicable, so this would
be evolutionary: Calcite's rules and algebra emerge intact in the new
order, and Calcite's metadata framework can model the new cost
functions. Extending Calcite could raise some interesting research
topics; is it possible to extend the parameter space (either the
number of parameters or the value range of those parameters) after
initiial planning?; can we use parameters to model whether
intermediate results are materialized (see [5]) or whether ephemeral
materialized views happen to be present in cache?; what new statistics
do we need to gather to power the new cost functions? There is enough
here to interest several researchers.

As for features:
* I would like to get to full compliance with OpenGIS, because spatial
support is much more straightforward in Calcite's algebraic approach
than in engines which need to build a new data structure.
* I also would like to give users a choice of engines in Calcite:
Spark and perhaps something based on Arrow, in addition to the
existing Enumerable engine.
* I would like to continue to make the planner more modular, so that
people can supply a program (a collection of rules organized into
planning phases) and basically just say "go".
* And I plan to continue my work to make data systems learn and adapt,
creating and populating materialized views based on observed query
patterns and data statistics.

Regarding governance. I think we are functioning well as a
meritocratic community. High-quality contributions arrive from people
who have never contributed before; this is happening more and more
frequently, which is really excellent. On the other hand, this
increases the load for reviewing (and pro-actively fixing)
contributions, and too much of that work still falls on my shoulders.
There are times when I get close to burn out, especially when people
explicitly direct questions and pull requests at me.

I think Michael would be an excellent PMC chair. I am delighted that
he is prepared to do the job.

Regarding CI. There is a bit more CI going on than meets the eye; I
run several tests nightly on my home server, and also on a Windows VM,
and speak up if things get broken. But I admit there has been bit-rot
in some of the adapters, and having a public CI for those adapters
would be useful, if we can do so without generating too much noise.

Julian

[1] https://calcite.apache.org/docs/powered_by.html

[2] https://issues.apache.org/jira/browse/CALCITE-707

[3] https://issues.apache.org/jira/browse/CALCITE-1991

[4] 
https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/abstract

[5] https://issues.apache.org/jira/browse/CALCITE-481

On Tue, Nov 7, 2017 at 9:19 AM, Josh Elser <[email protected]> wrote:
> On 11/6/17 12:00 PM, Jesus Camacho Rodriguez wrote:
>>
>> I am not involved in the Avatica effort, but it has been great to see
>> Avatica continue maturing, moving into its own repository and following with
>> its own release cadence. Josh, Julian, if you want to add a few lines about
>> the state of Avatica, that would be great.
>
>
> Would be happy to :)
>
> I've certainly been spending less time on core-functionality. Avatica has
> definitely passed the cusp for what most developers need. The majority of
> users would find Avatica to be fully-featured as a JDBC interface (but there
> are some gaps that still exist).
>
> We've started to see the focus on non-JDBC drivers for Avatica which is a
> great sign. Our Francis has been making progress on trying to adopt the
> driver written in Go into the Apache codebase. There are a few other drivers
> available as well. The presence of these drivers, and their ability to
> continue to function is good validation of the protocol/stability model that
> we outlined/implemented in the past 1-2 years.
>
> Avatica is still fairly low-volume, with only a few people contributing. I'd
> love to see more people take an interest (it's a great stepping stone into
> Calcite too ;P).

Re: [DISCUSS] The state of the project - 2017

Reply via email to