Vinoth and All, Users on gmail might be missing out on these emails as Gmail is down and emails sent to gmail.com domain are bouncing back. At 11pm UK time below is the google update: https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8b67908fadee664c68c240ff9f529ab Best to bump this thread again tomorrow when GMail is back up and running.
On Dec 15 2020, at 7:06 am, Vinoth Chandar <[email protected]> wrote: > Hello all, > > Just bumping this thread again > thanks > vinoth > > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <[email protected]> wrote: > > Hello all, > > > > One feature that keeps coming up is the ability to use UPDATE, MERGE sql > > syntax to support writing into Hudi tables. We have looked into the Spark 3 > > DataSource V2 APIs as well and found several issues that hinder us in > > implementing this via the Spark APIs > > > > - As of this writing, the UPDATE/MERGE syntax is not really opened up to > > external datasources like Hudi. only DELETE is. > > - DataSource V2 API offers no flexibility to perform any kind of > > further transformations to the dataframe. Hudi supports keys, indexes, > > preCombining and custom partitioning that ensures file sizes etc. All this > > needs shuffling data, looking up/joining against other dataframes so forth. > > Today, the DataSource V1 API allows this kind of further > > partitions/transformations. But the V2 API is simply offers partition level > > iteration once the user calls df.write.format("hudi") > > > > One thought I had is to explore Apache Calcite and write an adapter for > > Hudi. This frees us from being very dependent on a particular engine's > > syntax support like Spark. Calcite is very popular by itself and supports > > most of the key words and (also more streaming friendly syntax). To be > > clear, we will still be using Spark/Flink underneath to perform the actual > > writing, just that the SQL grammar is provided by Calcite. > > > > To give a taste of how this will look like. > > > > A) If the user wants to mutate a Hudi table using SQL > > > > Instead of writing something like : spark.sql("UPDATE ....") > > users will write : hudiSparkSession.sql("UPDATE ....") > > > > B) To save a Spark data frame to a Hudi table > > we continue to use Spark DataSource V1 > > > > The obvious challenge I see is the disconnect with the Spark DataFrame > > ecosystem. Users would write MERGE SQL statements by joining against other > > Spark DataFrames. > > If we want those expressed in calcite as well, we need to also invest in > > the full Query side support, which can increase the scope by a lot. > > Some amount of investigation needs to happen, but ideally we should be > > able to integrate with the sparkSQL catalog and reuse all the tables there. > > > > I am sure there are some gaps in my thinking. Just starting this thread, > > so we can discuss and others can chime in/correct me. > > > > thanks > > vinoth > > >
