Hi,pzwpzw
I see what you mean, it is very necessary to implement a common layer for hudi, 
and we are also planning to implement sparkSQL write capabilities for SQL-based 
ETL processing.Common Layer and SparkSQL Write can combine to form HUDI's SQL 
capabilities


















At 2020-12-21 19:30:36, "pzwpzw" <pengzhiwei2...@icloud.com.INVALID> wrote:

Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to 
process engine independent SQL,  for example most of the DDL、Hoodie CLI command 
and also provide parser for the common SQL extensions(e.g. Merge Into). The 
Engine-related syntax can be taught to the respective engines to process. If 
the common sql layer can handle the input sql, it handle it.Otherwise it is 
routed to the engine for processing. In long term, the common layer will more 
and more rich and perfect.
2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:


Hi,all


That's very good,Hudi SQL syntax can support Flink、hive and other analysis 
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict 
with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?




















在 2020-12-19 02:10:16,"Nishith" <n3.nas...@gmail.com> 写道:
That’s awesome. Looks like we have a consensus on Calcite. Look forward to the 
RFC as well!
-Nishith
On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vin...@apache.org> wrote:
Sounds good. Look forward to a RFC/DISCUSS thread.
ThanksVinoth
On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <danny0...@apache.org> wrote:
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i wouldadd 
support for SQL connectors of Hoodie Flink soon ~Currently, i'm preparing a 
refactoring to the current Flink writer code.
Vinoth Chandar <vin...@apache.org> 于2020年12月18日周五 上午6:39写道:
Thanks Kabeer for the note on gmail. Did not realize that. :)
My desired use case is user use the Hoodie CLI to execute these SQLs.They can 
choose what engine to use by a CLI config option.
Yes, that is also another attractive aspect of this route. We can buildouta 
common SQL layer and have this translate to the underlying engine(soundslike 
Hive huh)Longer term, if we really think we can more easily implement a full 
DML +DDL + DQL, we can proceed with this.
As others pointed out, for Spark SQL, it might be good to try the 
Sparkextensions route, before we take this on more fully.
The other part where Calcite is great is, all the support 
forwindowing/streaming in its syntax.Danny, I guess if we should be able to 
leverage that through a deeperFlink/Hudi integration?

On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vin...@apache.org>wrote:
I think Dongwook is investigating on the same lines. and it does seembetter to 
pursue this first, before trying other approaches.


On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2...@icloud.com.invalid>wrote:
Yeah I agree with Nishith that an option way is to look at thewaystoplug in 
custom logical and physical plans in Spark. It can simplifytheimplementation 
and reuse the Spark SQL syntax. And also usersfamiliarwithSpark SQL will be 
able to use HUDi's SQL features more quickly.In fact, spark have provided the 
SparkSessionExtensions interface forimplement custom syntax extensions and SQL 
rewrite rule.


https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.We
 can use the SparkSessionExtensions to extended hoodie sql syntaxsuchas MERGE 
INTO and DDL syntax.
2020年12月15日 下午3:27,Nishith <n3.nas...@gmail.com> 写道:
Thanks for starting this thread Vinoth.In general, definitely see the need for 
SQL style semantics on Huditables. Apache Calcite is a great option to 
considering givenDatasourceV2has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility,thesame 
SQL semantics needs to be supported in other engines like Flinktoprovide the 
same experience to users - which in itself could also beconsiderable amount of 
work.So, if we’re able to generalize on the SQL story along Calcite, 
thatwouldhelp reduce redundant work in some sense.Although, I’m worried about a 
few things
1) Like you pointed out, writing complex user jobs using Spark SQLsyntaxcan be 
harder for users who are moving from “Hudi syntax” to “Sparksyntax”for cross 
table joins, merges etc using data frames. One option is tolookat the if there 
are ways to plug in custom logical and physical plansinSpark, this way, 
although the merge on sparksql functionality may notbeassimple to use, but 
wouldn’t take away performance and feature set forstarters, in the future we 
could think of having the entire queryspacebepowered by calcite like you 
mentioned2) If we continue to use DatasourceV1, is there any downside to 
thisfroma performance and optimization perspective when executing plan - 
I’mguessing not but haven’t delved into the code to see if 
there’sanythingdifferent apart from the API and spec.
On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vin...@apache.org>wrote:

Hello all,

Just bumping this thread again

thanks
vinoth

On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vin...@apache.org>wrote:

Hello all,

One feature that keeps coming up is the ability to use UPDATE, MERGEsql
syntax to support writing into Hudi tables. We have looked into theSpark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs

- As of this writing, the UPDATE/MERGE syntax is not really opened upto
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys,indexes,
preCombining and custom partitioning that ensures file sizes etc. Allthis
needs shuffling data, looking up/joining against other dataframes soforth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partitionlevel
iteration once the user calls df.write.format("hudi")

One thought I had is to explore Apache Calcite and write an adapterfor
Hudi. This frees us from being very dependent on a particularengine's
syntax support like Spark. Calcite is very popular by itself andsupports
most of the key words and (also more streaming friendly syntax). Tobe
clear, we will still be using Spark/Flink underneath to perform theactual
writing, just that the SQL grammar is provided by Calcite.

To give a taste of how this will look like.

A) If the user wants to mutate a Hudi table using SQL

Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")

B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1

The obvious challenge I see is the disconnect with the SparkDataFrame
ecosystem. Users would write MERGE SQL statements by joining againstother
Spark DataFrames.
If we want those expressed in calcite as well, we need to also investin
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we shouldbe
able to integrate with the sparkSQL catalog and reuse all the tablesthere.

I am sure there are some gaps in my thinking. Just starting thisthread,
so we can discuss and others can chime in/correct me.

thanks
vinoth





Reply via email to