Apache Calcite is a good candidate for parsing and executing the SQL,
Apache Flink has an extension for the SQL based on the Calcite parser [1],
> users will write : hudiSparkSession.sql("UPDATE ....")
Should user still need to instatiate the hudiSparkSession first ? My
desired use case is user use the Hoodie CLI to execute these SQLs. They can
choose what engine to use by a CLI config option.
> If we want those expressed in Calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
That is true, my thought is that we use the Calcite to execute only these
MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
parse the query statements and handover it to the engines. One thing needs
to note is the SQL dialect difference, the Spark may have its own
syntax(keywords) that Calcite can not parse/recognize.
[1]
https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
Vinoth Chandar <[email protected]> 于2020年12月11日周五 下午3:58写道:
> Hello all,
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> syntax to support writing into Hudi tables. We have looked into the Spark 3
> DataSource V2 APIs as well and found several issues that hinder us in
> implementing this via the Spark APIs
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> external datasources like Hudi. only DELETE is.
> - DataSource V2 API offers no flexibility to perform any kind of
> further transformations to the dataframe. Hudi supports keys, indexes,
> preCombining and custom partitioning that ensures file sizes etc. All this
> needs shuffling data, looking up/joining against other dataframes so forth.
> Today, the DataSource V1 API allows this kind of further
> partitions/transformations. But the V2 API is simply offers partition level
> iteration once the user calls df.write.format("hudi")
>
> One thought I had is to explore Apache Calcite and write an adapter for
> Hudi. This frees us from being very dependent on a particular engine's
> syntax support like Spark. Calcite is very popular by itself and supports
> most of the key words and (also more streaming friendly syntax). To be
> clear, we will still be using Spark/Flink underneath to perform the actual
> writing, just that the SQL grammar is provided by Calcite.
>
> To give a taste of how this will look like.
>
> A) If the user wants to mutate a Hudi table using SQL
>
> Instead of writing something like : spark.sql("UPDATE ....")
> users will write : hudiSparkSession.sql("UPDATE ....")
>
> B) To save a Spark data frame to a Hudi table
> we continue to use Spark DataSource V1
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
> ecosystem. Users would write MERGE SQL statements by joining against other
> Spark DataFrames.
> If we want those expressed in calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
> Some amount of investigation needs to happen, but ideally we should be able
> to integrate with the sparkSQL catalog and reuse all the tables there.
>
> I am sure there are some gaps in my thinking. Just starting this thread, so
> we can discuss and others can chime in/correct me.
>
> thanks
> vinoth
>