+1 for Calcite

Best,
Vino

David Sheard <[email protected]> 于2020年12月17日周四 下午2:15写道:

> I agree with Calcite
>
> On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan <[email protected]> wrote:
>
> > Apache Calcite is a good candidate for parsing and executing the SQL,
> > Apache Flink has an extension for the SQL based on the Calcite parser
> [1],
> >
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > Should user still need to instatiate the hudiSparkSession first ? My
> > desired use case is user use the Hoodie CLI to execute these SQLs. They
> can
> > choose what engine to use by a CLI config option.
> >
> > > If we want those expressed in Calcite as well, we need to also invest
> in
> > the full Query side support, which can increase the scope by a lot.
> >
> > That is true, my thought is that we use the Calcite to execute only these
> > MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
> > to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
> > parse the query statements and handover it to the engines. One thing
> needs
> > to note is the SQL dialect difference, the Spark may have its own
> > syntax(keywords) that Calcite can not parse/recognize.
> >
> > [1]
> >
> >
> https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
> >
> > Vinoth Chandar <[email protected]> 于2020年12月11日周五 下午3:58写道:
> >
> > > Hello all,
> > >
> > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> sql
> > > syntax to support writing into Hudi tables. We have looked into the
> > Spark 3
> > > DataSource V2 APIs as well and found several issues that hinder us in
> > > implementing this via the Spark APIs
> > >
> > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> to
> > > external datasources like Hudi. only DELETE is.
> > > - DataSource V2 API offers no flexibility to perform any kind of
> > > further transformations to the dataframe. Hudi supports keys, indexes,
> > > preCombining and custom partitioning that ensures file sizes etc. All
> > this
> > > needs shuffling data, looking up/joining against other dataframes so
> > forth.
> > > Today, the DataSource V1 API allows this kind of further
> > > partitions/transformations. But the V2 API is simply offers partition
> > level
> > > iteration once the user calls df.write.format("hudi")
> > >
> > > One thought I had is to explore Apache Calcite and write an adapter for
> > > Hudi. This frees us from being very dependent on a particular engine's
> > > syntax support like Spark. Calcite is very popular by itself and
> supports
> > > most of the key words and (also more streaming friendly syntax). To be
> > > clear, we will still be using Spark/Flink underneath to perform the
> > actual
> > > writing, just that the SQL grammar is provided by Calcite.
> > >
> > > To give a taste of how this will look like.
> > >
> > > A) If the user wants to mutate a Hudi table using SQL
> > >
> > > Instead of writing something like : spark.sql("UPDATE ....")
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> > >
> > > B) To save a Spark data frame to a Hudi table
> > > we continue to use Spark DataSource V1
> > >
> > > The obvious challenge I see is the disconnect with the Spark DataFrame
> > > ecosystem. Users would write MERGE SQL statements by joining against
> > other
> > > Spark DataFrames.
> > > If we want those expressed in calcite as well, we need to also invest
> in
> > > the full Query side support, which can increase the scope by a lot.
> > > Some amount of investigation needs to happen, but ideally we should be
> > able
> > > to integrate with the sparkSQL catalog and reuse all the tables there.
> > >
> > > I am sure there are some gaps in my thinking. Just starting this
> thread,
> > so
> > > we can discuss and others can chime in/correct me.
> > >
> > > thanks
> > > vinoth
> > >
> >
>

Reply via email to