Thanks for starting this discussion, Jesus. Here are some thoughts, in no particular order.
I too have noticed the increase in academic adoption. This is excellent. Shall we add a section to the "Powered by" page [1] on academic projects and papers? I worry a lot about audience (or audiences). Who is using Calcite? Are we giving them what they need? Data engines (such as Drill, Hive and Flink) are one category, and I think they are fairly well served. Academics are another audience; some are succeeding, but I wonder whether it would be easier for them if we had some relevant examples, such as how to parse a query and optimize it using several different cost models and combinations of rules. What other audiences are there? There is an audience who would like to use Calcite as a standalone engine; and folks who would like to incorporate materialized views, indexes and constraints into their engine but prefer to speak SQL rather than Java APIs. Those groups are not well served today. I am working on a server which has DDL support[2][3]; it would provide a (simple) standalone engine, but also allow us to demo materialized views, virtual columns, check constraints and foreign tables/schemas via SQL so that people building engines can more easily grasp the concepts. I read Trumer & Koch's paper "Multi-objective parametric query optimization" [4] in CACM recently. It is a very exciting advance, and too much to cover in this thread, but it got me thinking about how Calcite could evolve to incorporate their ideas. I realize that giving RelOptCost multiple fields was a mistake, unless we also add the mechanics (piecewise-linear cost functions and polytopes) to handle them. The vast majority of Calcite remains applicable, so this would be evolutionary: Calcite's rules and algebra emerge intact in the new order, and Calcite's metadata framework can model the new cost functions. Extending Calcite could raise some interesting research topics; is it possible to extend the parameter space (either the number of parameters or the value range of those parameters) after initiial planning?; can we use parameters to model whether intermediate results are materialized (see [5]) or whether ephemeral materialized views happen to be present in cache?; what new statistics do we need to gather to power the new cost functions? There is enough here to interest several researchers. As for features: * I would like to get to full compliance with OpenGIS, because spatial support is much more straightforward in Calcite's algebraic approach than in engines which need to build a new data structure. * I also would like to give users a choice of engines in Calcite: Spark and perhaps something based on Arrow, in addition to the existing Enumerable engine. * I would like to continue to make the planner more modular, so that people can supply a program (a collection of rules organized into planning phases) and basically just say "go". * And I plan to continue my work to make data systems learn and adapt, creating and populating materialized views based on observed query patterns and data statistics. Regarding governance. I think we are functioning well as a meritocratic community. High-quality contributions arrive from people who have never contributed before; this is happening more and more frequently, which is really excellent. On the other hand, this increases the load for reviewing (and pro-actively fixing) contributions, and too much of that work still falls on my shoulders. There are times when I get close to burn out, especially when people explicitly direct questions and pull requests at me. I think Michael would be an excellent PMC chair. I am delighted that he is prepared to do the job. Regarding CI. There is a bit more CI going on than meets the eye; I run several tests nightly on my home server, and also on a Windows VM, and speak up if things get broken. But I admit there has been bit-rot in some of the adapters, and having a public CI for those adapters would be useful, if we can do so without generating too much noise. Julian [1] https://calcite.apache.org/docs/powered_by.html [2] https://issues.apache.org/jira/browse/CALCITE-707 [3] https://issues.apache.org/jira/browse/CALCITE-1991 [4] https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/abstract [5] https://issues.apache.org/jira/browse/CALCITE-481 On Tue, Nov 7, 2017 at 9:19 AM, Josh Elser <[email protected]> wrote: > On 11/6/17 12:00 PM, Jesus Camacho Rodriguez wrote: >> >> I am not involved in the Avatica effort, but it has been great to see >> Avatica continue maturing, moving into its own repository and following with >> its own release cadence. Josh, Julian, if you want to add a few lines about >> the state of Avatica, that would be great. > > > Would be happy to :) > > I've certainly been spending less time on core-functionality. Avatica has > definitely passed the cusp for what most developers need. The majority of > users would find Avatica to be fully-featured as a JDBC interface (but there > are some gaps that still exist). > > We've started to see the focus on non-JDBC drivers for Avatica which is a > great sign. Our Francis has been making progress on trying to adopt the > driver written in Go into the Apache codebase. There are a few other drivers > available as well. The presence of these drivers, and their ability to > continue to function is good validation of the protocol/stability model that > we outlined/implemented in the past 1-2 years. > > Avatica is still fairly low-volume, with only a few people contributing. I'd > love to see more people take an interest (it's a great stepping stone into > Calcite too ;P).
