+1 for Calcite Best, Vino
David Sheard <[email protected]> 于2020年12月17日周四 下午2:15写道: > I agree with Calcite > > On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan <[email protected]> wrote: > > > Apache Calcite is a good candidate for parsing and executing the SQL, > > Apache Flink has an extension for the SQL based on the Calcite parser > [1], > > > > > users will write : hudiSparkSession.sql("UPDATE ....") > > > > Should user still need to instatiate the hudiSparkSession first ? My > > desired use case is user use the Hoodie CLI to execute these SQLs. They > can > > choose what engine to use by a CLI config option. > > > > > If we want those expressed in Calcite as well, we need to also invest > in > > the full Query side support, which can increase the scope by a lot. > > > > That is true, my thought is that we use the Calcite to execute only these > > MERGE SQL statements. For DQL or DML, we would delegate the parse/execute > > to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only > > parse the query statements and handover it to the engines. One thing > needs > > to note is the SQL dialect difference, the Spark may have its own > > syntax(keywords) that Calcite can not parse/recognize. > > > > [1] > > > > > https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen > > > > Vinoth Chandar <[email protected]> 于2020年12月11日周五 下午3:58写道: > > > > > Hello all, > > > > > > One feature that keeps coming up is the ability to use UPDATE, MERGE > sql > > > syntax to support writing into Hudi tables. We have looked into the > > Spark 3 > > > DataSource V2 APIs as well and found several issues that hinder us in > > > implementing this via the Spark APIs > > > > > > - As of this writing, the UPDATE/MERGE syntax is not really opened up > to > > > external datasources like Hudi. only DELETE is. > > > - DataSource V2 API offers no flexibility to perform any kind of > > > further transformations to the dataframe. Hudi supports keys, indexes, > > > preCombining and custom partitioning that ensures file sizes etc. All > > this > > > needs shuffling data, looking up/joining against other dataframes so > > forth. > > > Today, the DataSource V1 API allows this kind of further > > > partitions/transformations. But the V2 API is simply offers partition > > level > > > iteration once the user calls df.write.format("hudi") > > > > > > One thought I had is to explore Apache Calcite and write an adapter for > > > Hudi. This frees us from being very dependent on a particular engine's > > > syntax support like Spark. Calcite is very popular by itself and > supports > > > most of the key words and (also more streaming friendly syntax). To be > > > clear, we will still be using Spark/Flink underneath to perform the > > actual > > > writing, just that the SQL grammar is provided by Calcite. > > > > > > To give a taste of how this will look like. > > > > > > A) If the user wants to mutate a Hudi table using SQL > > > > > > Instead of writing something like : spark.sql("UPDATE ....") > > > users will write : hudiSparkSession.sql("UPDATE ....") > > > > > > B) To save a Spark data frame to a Hudi table > > > we continue to use Spark DataSource V1 > > > > > > The obvious challenge I see is the disconnect with the Spark DataFrame > > > ecosystem. Users would write MERGE SQL statements by joining against > > other > > > Spark DataFrames. > > > If we want those expressed in calcite as well, we need to also invest > in > > > the full Query side support, which can increase the scope by a lot. > > > Some amount of investigation needs to happen, but ideally we should be > > able > > > to integrate with the sparkSQL catalog and reuse all the tables there. > > > > > > I am sure there are some gaps in my thinking. Just starting this > thread, > > so > > > we can discuss and others can chime in/correct me. > > > > > > thanks > > > vinoth > > > > > >
