Hi everyone,

This is a really great discussion. Thanks for starting it Martijn and your input Jacques! I have been fighting against forking Calcite in Flink for years already. Even when merging forks of Flink that transitively forked Calcite, in the end we were able to resolve conflicts / contribute blockers back into Calcite. And I strongly believe that this is the better approach for long-term success for both projects.

I would like to get more involved in the Calcite community. I have been implementing and managing Flink SQL based on Calcite since 2016. Thus, I feel confident to say that I know the code base and some quirks in the stack very well.

Capacity-wise I will try to reserve some time for helping the Calcite community. Happy to get some pointers where and how I can help.

I will take a look at https://github.com/apache/calcite/pull/2606 this week to get the ball rolling. As this is an important addition and prepares for "customer SQL operators" in Flink SQL.

Regards,
Timo

On 21.06.22 22:18, Charles Givre wrote:
As the PMC for Apache Drill, I'd echo everyone's comments here.... Don't fork.  
 Don't do it.

Apache Drill forked Calcite several years ago which Calcite was on version 1.20 or 1.21.  
While this meant that some bugs were easily fixed, what it also meant that as our fork 
diverged from "regular" Calcite, it became harder and harder to maintain.  It 
also meant that we were chasing bugs that had since been fixed.

Drill is in the process of "de-forking" Calcite, meaning that we're ditching 
our fork and re-integrating with standard Calcite.  It has been A TON of work and we have 
contributed (and will continue to contribute) bug fixes and PRs to Calcite. In the long 
run, I think this will be beneficial for both communities.

Best,
-- C


On Jun 21, 2022, at 1:57 PM, Julian Hyde <jhyde.apa...@gmail.com> wrote:

Please don’t fork Calcite.

Calcite suffers from the tragedy of the commons. Unlike many open source data 
projects, there is no commercial project that directly maps to Calcite (even 
though Calcite is an essential part of many projects). As a result no engineers 
work full-time on Calcite.

It takes more than pull requests to keep a project going. We need reviewers, 
people to work on releases, people to fix bugs (such as security bugs) that are 
important to everyone but urgent to no one.

We have plenty of committers in Calcite, and add several more per year. We rely 
on those committers taking on their share of the housework, but the burden 
falls on too few people.

Engineering managers need to start paying a little more for the “free lunch” 
that they enjoy when Calcite “just works” in their project. Sadly, most 
engineering managers are not subscribed to this list.

Julian


On Jun 21, 2022, at 9:49 AM, Jacques Nadeau <jacq...@apache.org> wrote:

Martijn, thanks for sharing that thread in the Flink community.

I'm someone who has forked Calcite twice: once in Apache Drill and again in
Dremio. In both cases, it was all about trading short term benefits against
long term costs. In both cases, I think the net amount of work was probably
5x as much as what it would have been if we had just done a better job
engaging the community. If I were to state the curve of behavior over six
years, I'd guess that in both cases the numbers of effort looked like this:

estimated effort doing high intensity integration with calcite (years 1-6)
fork: 1, 5, 10, 50, 100, 200, total = 366
non-fork: 10, 10, 10, 10, 10, total = 50

So yes, the first couple years you're ahead. But you pay a massive
technical debt premium long term. Early in a project (Drill) or company's
life (Dremio), it can make sense to sacrifice long term for short term but
it's important people do it with their eyes open.

The reason that this pain is so high is that as your codebases diverge, you
start having to do everything the Calcite community does by yourself.
Backports become harder and things that you need (e.g. new sql syntax, etc)
have to be reimplemented (even if someone else already implemented them in
some post-fork Calcite version. Ultimately, at some point you realize that
your path is untenable and you unfork. This becomes the biggest expense of
them all and I believe both of those teams are still trying to un-fork. The
additional thing that becomes an even bigger problem is your absence from
the Calcite community means that people may take the project or APIs in
ways that are in direct conflict to how you use the library. Since you're
not active in the project, you fail to provide a counterpoint and then
you're basically just in a miserable place. The Hive project did this best
by ensuring that releases of Calcite were also run pre-release against Hive
to make sure no major regressions occurred. By being in the community and
active, this is the best state from my pov. (It makes your project better
and Calcite better.)

Two last notes:
- I'm not sure the rocks fork is comparable to forking Calcite. The api
surface area and community models are very different.
- This is all based on a high intensity integration (using rules + planner
or sql + rules + planner). Calcite is frustratingly monolithic and if
someone was only going to use a small component, my opinion would likely be
very different.

I'd send this to the Flink list but I'm not subscribed. It'd be great if
you shared it with the people over there if you think they'd find it useful.



On Tue, Jun 21, 2022 at 12:31 AM Martijn Visser <martijnvis...@apache.org>
wrote:

Thanks Julian and Austin!

Any reply to kick-off some sort of discussion is worthwhile :D
I definitely know the feeling of having more PRs open then you would like,
looking at https://github.com/apache/flink/pulls :)

There have been discussions in the Flink community about forking Calcite
[1]. My personal preference at the moment is to see if we can create a
better collaboration and community. I believe that we can find people from
the Flink community who can open / help reviewing Calcite PRs that are
interesting for the Flink community. The question is if that will also help
short term since in the end it still requires a Calcite maintainer to
review/merge.

Best regards,

Martijn

[1] https://lists.apache.org/thread/1oqydpsm4mc55bkk440gx9lr9gf2rvf4


Op ma 20 jun. 2022 om 23:51 schreef Austin Bennett <
whatwouldausti...@gmail.com>:

 From the peanut gallery :-)  -->

Wow; yes, lots of open PRs.  https://github.com/apache/calcite/pulls

How can individuals from the Flink [sub-]community, and/or more general
calcite community help lighten this load?  Is there much weight given to
reviews from non-committers; how to increase the # of people capable of
providing worthwhile reviews [ that are recognized as such ]?



On Mon, Jun 20, 2022 at 11:47 AM Julian Hyde <jhyde.apa...@gmail.com>
wrote:

Martijn,

Since you requested a reply, I am replying. To answer your question, I
don’t know of a way to move this topic forward. We have more PRs than
people to review them.

Julian


On Jun 19, 2022, at 11:58 PM, Martijn Visser <
martijnvis...@apache.org

wrote:

Hi everyone,

I just wanted to reach out to the Calcite community once more on this
topic
since no reply was received. Would be great if someone could get back
to
us.

Best regards,

Martijn

Op wo 8 jun. 2022 om 11:24 schreef Martijn Visser <
martijnvis...@apache.org
:

Hi everyone,

I would like to follow-up on this email that was sent by Jing. So
far,
no
progress has been made, despite reaching out to the mailing list,
the
original Jira ticket and reaching out to people directly. Is there a
way
that we can move this PR/topic forward?

For context, in Apache Flink we're currently heavily using Calcite.
However, we are now at the stage where Calcite is actually holding
us
back.
It would be great if we can find a way to strengthen our bond and
move
both
Calcite and Flink forward.

Looking forward to your thoughts,

Martijn

On 2022/01/26 07:05:37 Jing Zhang wrote:
Hi community,
My apologies for interrupting.
Anyone could help to review the pr
https://github.com/apache/calcite/pull/2606?
Thanks a lot.

CALCITE-4865 is the first sub-task of CALCITE-4864. This Jira aims
to
extend existing Table function in order to support Polymorphic
Table
Function which is introduced as the part of ANSI SQL 2016.

The brief change logs of the PR are:
- Update `Parser.jj` to support partition by clause and order by
clause
for input table with set semantics of PTF
- Introduce `TableCharacteristics` which contains three
characteristics
of input table of table function
- Update `SqlTableFunction` to add a method
`tableCharacteristics`,
the
method returns the table characteristics for the ordinal-th
argument
to
this table function. Default return value is Optional.empty which
means
the
ordinal-th argument is not table.
- Introduce `SqlSetSemanticsTable` which represents input table
with
set
semantics of Table Function, its `SqlKind` is `SET_SEMANTICS_TABLE`
- Updates `SqlValidatorImpl` to validate only set semantic table
of
Table
Function could have partition by and order by clause
- Update `SqlToRelConverter#substituteSubQuery` to parse subQuery
which
represents set semantics table.

PR: https://github.com/apache/calcite/pull/2606
JIRA: https://issues.apache.org/jira/browse/CALCITE-4865
Parent JARA: https://issues.apache.org/jira/browse/CALCITE-4864

Best,
Jing Zhang









Reply via email to