I agree with Calcite
On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan <[email protected]> wrote:
> Apache Calcite is a good candidate for parsing and executing the SQL,
> Apache Flink has an extension for the SQL based on the Calcite parser [1],
>
> > users will write : hudiSparkSession.sql("UPDATE ....")
>
> Should user still need to instatiate the hudiSparkSession first ? My
> desired use case is user use the Hoodie CLI to execute these SQLs. They can
> choose what engine to use by a CLI config option.
>
> > If we want those expressed in Calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
>
> That is true, my thought is that we use the Calcite to execute only these
> MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
> to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
> parse the query statements and handover it to the engines. One thing needs
> to note is the SQL dialect difference, the Spark may have its own
> syntax(keywords) that Calcite can not parse/recognize.
>
> [1]
>
> https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
>
> Vinoth Chandar <[email protected]> 于2020年12月11日周五 下午3:58写道:
>
> > Hello all,
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> > syntax to support writing into Hudi tables. We have looked into the
> Spark 3
> > DataSource V2 APIs as well and found several issues that hinder us in
> > implementing this via the Spark APIs
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> > external datasources like Hudi. only DELETE is.
> > - DataSource V2 API offers no flexibility to perform any kind of
> > further transformations to the dataframe. Hudi supports keys, indexes,
> > preCombining and custom partitioning that ensures file sizes etc. All
> this
> > needs shuffling data, looking up/joining against other dataframes so
> forth.
> > Today, the DataSource V1 API allows this kind of further
> > partitions/transformations. But the V2 API is simply offers partition
> level
> > iteration once the user calls df.write.format("hudi")
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> > Hudi. This frees us from being very dependent on a particular engine's
> > syntax support like Spark. Calcite is very popular by itself and supports
> > most of the key words and (also more streaming friendly syntax). To be
> > clear, we will still be using Spark/Flink underneath to perform the
> actual
> > writing, just that the SQL grammar is provided by Calcite.
> >
> > To give a taste of how this will look like.
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > B) To save a Spark data frame to a Hudi table
> > we continue to use Spark DataSource V1
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> > ecosystem. Users would write MERGE SQL statements by joining against
> other
> > Spark DataFrames.
> > If we want those expressed in calcite as well, we need to also invest in
> > the full Query side support, which can increase the scope by a lot.
> > Some amount of investigation needs to happen, but ideally we should be
> able
> > to integrate with the sparkSQL catalog and reuse all the tables there.
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> so
> > we can discuss and others can chime in/correct me.
> >
> > thanks
> > vinoth
> >
>