[jira] [Commented] (HUDI-2063) [SQL] Add Doc For Spark Sql Integrates With Hudi

ASF GitHub Bot (Jira) Tue, 10 Aug 2021 07:02:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396694#comment-17396694
 ]


ASF GitHub Bot commented on HUDI-2063:
--------------------------------------

nsivabalan commented on a change in pull request #3140:
URL: https://github.com/apache/hudi/pull/3140#discussion_r686014259



##########
File path: website/docs/querying_data.md
##########
@@ -119,6 +119,11 @@ By default, Spark SQL will try to use its own parquet 
reader instead of Hive Ser
 both parquet and avro data, this default setting needs to be turned off using 
set `spark.sql.hive.convertMetastoreParquet=false`. 
 This will force Spark to fallback to using the Hive Serde to read the data 
(planning/executions is still Spark). 
 
+**NOTICE**
+
+Since 0.9.0 hudi will sync the table to hive as a spark datasource table. So 
we do not need the `spark.sql.hive.convertMetastoreParquet=false`

Review comment:
       Not required in this patch. But just curious. Did we test an upgrade 
scenario? i.e. an already existing table that was created w/ 0.8.0. And 
upgraded to 0.9.0. and flipping the hive sync by default works smoothly w/o any 
issues. If not, can you create a follow up ticket. I can look into it. 

##########
File path: website/docs/querying_data.md
##########
@@ -141,14 +146,13 @@ and executors. Alternatively, hudi-spark-bundle can also 
fetched via the `--pack
 
 ### Snapshot query {#spark-snap-query}
 This method can be used to retrieve the data table at the present point in 
time.
-Note: The file path must be suffixed with a number of wildcard asterisk (`/*`) 
one greater than the number of partition levels. Eg: with table file path 
"tablePath" partitioned by columns "a", "b", and "c", the load path must be 
`tablePath + "/*/*/*/*"`
 
 ```scala
 val hudiIncQueryDF = spark
      .read()
-     .format("org.apache.hudi")
+     .format("hudi")
      .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
-     .load(tablePath + "/*") //The number of wildcard asterisks here must be 
one greater than the number of partition
+     .load(tablePath) 

Review comment:
       So, this is not required anymore is it? 

##########
File path: website/docs/quick-start-guide.md
##########
@@ -22,26 +22,49 @@ defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
+{ label: 'SparkSQL', value: 'sparksql', },
 ]}>
 <TabItem value="scala">
 
 ```scala
 // spark-shell for spark 3
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
+  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
 // spark-shell for spark 2 with scala 2.12
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.4
 \
+  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:2.4.4
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
 // spark-shell for spark 2 with scala 2.11
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4
 \
+  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.9.0,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
+</TabItem>
+<TabItem value="sparksql">
+
+Hudi support using spark sql to write and read data with the 
**HoodieSparkSessionExtension** sql extension.
+```shell
+# spark sql for spark 3
+spark-sql --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1
 \

Review comment:
       These versions can left as 0.9.0 since this feature is available only in 
0.9.0

##########
File path: website/docs/quick-start-guide.md
##########
@@ -572,11 +859,83 @@ df.write.format("hudi").
 // Should have different keys now for San Francisco alone, from query before.
 spark.
   read.format("hudi").
-  load(basePath + "/*/*/*/*").
+  load(basePath).
   select("uuid","partitionpath").
   sort("partitionpath","uuid").
   show(100, false)
 ```
+**NOTICE**
+
+The insert overwrite partitioned table sql statement will convert to the 
***insert_overwrite*** operation.

Review comment:
       again, can we add an example command here please

##########
File path: website/docs/quick-start-guide.md
##########
@@ -22,26 +22,49 @@ defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
+{ label: 'SparkSQL', value: 'sparksql', },
 ]}>
 <TabItem value="scala">
 
 ```scala
 // spark-shell for spark 3
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
+  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1
 \

Review comment:
       Guess we should not update these versions in this patch. As part of 
release, the release manager will do one time update of all such versions. Can 
we revert these please. 

##########
File path: website/docs/quick-start-guide.md
##########
@@ -119,6 +142,85 @@ The 
[DataGenerator](https://github.com/apache/hudi/blob/master/hudi-spark/src/ma
 can generate sample inserts and updates based on the the sample trip schema 
[here](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
 :::
 
+## Create Table
+
+Hudi support create table using spark-sql.
+
+**Create Non-Partitioned Table**
+```sql
+-- create a managed cow table
+create table if not exists h0(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow',
+  primaryKey = 'id'
+);
+
+-- creae an external mor table
+create table if not exists h1(
+  id int, 
+  name string, 
+  price double,
+  ts bigint
+) using hudi
+location '/tmp/hudi/h0'  
+options (
+  type = 'mor',
+  primaryKey = 'id,name',
+  preCombineField = 'ts' 
+)
+;
+
+-- create a non-primary key table
+create table if not exists h2(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow'
+);
+```
+**Create Partitioned Table**
+```sql
+create table if not exists h_p0 (
+id bigint,
+name string,
+dt string，
+hh string  
+) using hudi
+location '/tmp/hudi/h_p0'
+options (
+  type = 'cow',
+  primaryKey = 'id',
+  preCombineField = 'ts'
+ ) 
+partitioned by (dt, hh)
+;
+```
+**Create Table On The Exists Table Path**
+
+We can create a table on an exists hudi table path. This is useful to 
read/write from a non-sql hudi table by spark-sql.
+```sql
+ create table h_p1 using hudi 
+ options (
+    primaryKey = 'id',
+    preCombineField = 'ts'
+ )
+ partitioned by (dt)
+ location '/path/to/hudi'
+```
+
+**Create Table Options**
+
+| Parameter Name | Introduction |
+|------------|--------|
+| primaryKey | The primary key names of the table, multiple fields separated 
by commas. |
+| type       | The table type to create. type = 'cow' means a COPY-ON-WRITE 
table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' 
without specified this option.|
+| preCombineField | The Pre-Combine field of the table. |

Review comment:
       I would also suggest to move the "set options" section here. Bcoz, it 
goes with create table. May be briefly talk about how to set hudi configs for a 
given table here. and in the later section, you can dive into details. But as 
of now, its towards the very end, not sure if users will get to the end to read 
those sections. likely some may skip.

##########
File path: website/docs/quick-start-guide.md
##########
@@ -119,6 +142,85 @@ The 
[DataGenerator](https://github.com/apache/hudi/blob/master/hudi-spark/src/ma
 can generate sample inserts and updates based on the the sample trip schema 
[here](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
 :::
 
+## Create Table
+
+Hudi support create table using spark-sql.
+
+**Create Non-Partitioned Table**
+```sql
+-- create a managed cow table
+create table if not exists h0(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow',
+  primaryKey = 'id'
+);
+
+-- creae an external mor table
+create table if not exists h1(
+  id int, 
+  name string, 
+  price double,
+  ts bigint
+) using hudi
+location '/tmp/hudi/h0'  
+options (
+  type = 'mor',
+  primaryKey = 'id,name',
+  preCombineField = 'ts' 
+)
+;
+
+-- create a non-primary key table
+create table if not exists h2(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow'
+);
+```
+**Create Partitioned Table**
+```sql
+create table if not exists h_p0 (
+id bigint,
+name string,
+dt string，
+hh string  
+) using hudi
+location '/tmp/hudi/h_p0'
+options (
+  type = 'cow',
+  primaryKey = 'id',
+  preCombineField = 'ts'
+ ) 
+partitioned by (dt, hh)
+;
+```
+**Create Table On The Exists Table Path**
+
+We can create a table on an exists hudi table path. This is useful to 
read/write from a non-sql hudi table by spark-sql.
+```sql
+ create table h_p1 using hudi 
+ options (
+    primaryKey = 'id',
+    preCombineField = 'ts'
+ )
+ partitioned by (dt)
+ location '/path/to/hudi'
+```
+
+**Create Table Options**
+
+| Parameter Name | Introduction |
+|------------|--------|
+| primaryKey | The primary key names of the table, multiple fields separated 
by commas. |
+| type       | The table type to create. type = 'cow' means a COPY-ON-WRITE 
table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' 
without specified this option.|
+| preCombineField | The Pre-Combine field of the table. |

Review comment:
       Can we add details on CTAS as well. either in this section or in the 
insert section (lines 250 ish)

##########
File path: website/docs/quick-start-guide.md
##########
@@ -119,6 +142,85 @@ The 
[DataGenerator](https://github.com/apache/hudi/blob/master/hudi-spark/src/ma
 can generate sample inserts and updates based on the the sample trip schema 
[here](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
 :::
 
+## Create Table
+
+Hudi support create table using spark-sql.
+
+**Create Non-Partitioned Table**

Review comment:
       Can we add some intro to managed and external table. Just 1 to 2 lines 
and add some external links so that interested folks can refer the links for 
more info. 
   Also, Can we give some context on what are the different types of tables we 
can create w/ hudi. 
   again, 1 line brief details about primary key table vs non-primary key 
table, partitioned table vs non-partitioned table. 
   And then we can dive into the create table commands

##########
File path: website/docs/quick-start-guide.md
##########
@@ -146,6 +249,55 @@ df.write.format("hudi").
   save(basePath)
 ``` 
 
+</TabItem>
+<TabItem value="sparksql">
+    
+```sql
+insert into h0 select 1, 'a1', 20;
+
+-- insert static partition
+insert into h_p0 partition(dt = '2021-01-02') select 1, 'a1';
+
+-- insert dynamic partition
+insert into h_p0 select 1, 'a1', dt;
+
+-- insert dynamic partition
+insert into h_p1 select 1 as id, 'a1', '2021-01-03' as dt, '19' as hh;
+
+-- insert overwrite table
+insert overwrite table h0 select 1, 'a1', 20;
+
+-- insert overwrite table with static partition
+insert overwrite h_p0 partition(dt = '2021-01-02') select 1, 'a1';
+
+- insert overwrite table with dynamic partition
+  insert overwrite table h_p1 select 2 as id, 'a2', '2021-01-03' as dt, '19' 
as hh;
+```
+
+**NOTICE**
+
+1. Insert mode
+
+Hudi support three insert modes when inserting data to a table with primary 
key(we call it pk-table as followed):
+- upsert
+  
+  This it the default insert mode. For upsert mode, insert statement do the 
upsert operation for the pk-table which will update the duplicate record
+- strict
+
+For strict mode, insert statement will keep the primary key uniqueness 
constraint for COW table which do not allow duplicate record.
+If inserting a record which the primary key is already exists to the table, a 
HoodieDuplicateKeyException will throw out
+for COW table. For MOR table, it has the same behavior with "upsert" mode.
+
+- non-strict
+
+For non-strict mode, hudi just do the insert operation for the pk-table.
+
+We can set the inset mode by the config: **hoodie.sql.insert.mode**

Review comment:
       typo. "insert"

##########
File path: website/docs/quick-start-guide.md
##########
@@ -51,17 +74,17 @@ export PYSPARK_PYTHON=$(which python3)
 
 # for spark3
 pyspark
---packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
+--packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1

Review comment:
       same comment as above. lets not flip to 0.9.0 yet

##########
File path: website/docs/quick-start-guide.md
##########
@@ -533,12 +816,16 @@ df.write.format("hudi").
 // Should have different keys now, from query before.
 spark.
   read.format("hudi").
-  load(basePath + "/*/*/*/*").
+  load(basePath).
   select("uuid","partitionpath").
   show(10, false)
 
 ``` 
 
+**NOTICE**
+
+The insert overwrite non-partitioned table sql statement will convert to the 
***insert_overwrite_table*** operation.

Review comment:
       Can we add an example command here

##########
File path: website/docs/quick-start-guide.md
##########
@@ -119,6 +142,85 @@ The 
[DataGenerator](https://github.com/apache/hudi/blob/master/hudi-spark/src/ma
 can generate sample inserts and updates based on the the sample trip schema 
[here](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
 :::
 
+## Create Table
+
+Hudi support create table using spark-sql.
+
+**Create Non-Partitioned Table**
+```sql
+-- create a managed cow table
+create table if not exists h0(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow',
+  primaryKey = 'id'
+);
+
+-- creae an external mor table
+create table if not exists h1(
+  id int, 
+  name string, 
+  price double,
+  ts bigint
+) using hudi
+location '/tmp/hudi/h0'  
+options (
+  type = 'mor',
+  primaryKey = 'id,name',
+  preCombineField = 'ts' 
+)
+;
+
+-- create a non-primary key table
+create table if not exists h2(
+  id int, 
+  name string, 
+  price double
+) using hudi
+options (
+  type = 'cow'
+);
+```
+**Create Partitioned Table**
+```sql
+create table if not exists h_p0 (
+id bigint,
+name string,
+dt string，
+hh string  
+) using hudi
+location '/tmp/hudi/h_p0'
+options (
+  type = 'cow',
+  primaryKey = 'id',
+  preCombineField = 'ts'
+ ) 
+partitioned by (dt, hh)
+;
+```
+**Create Table On The Exists Table Path**
+
+We can create a table on an exists hudi table path. This is useful to 
read/write from a non-sql hudi table by spark-sql.
+```sql
+ create table h_p1 using hudi 
+ options (
+    primaryKey = 'id',
+    preCombineField = 'ts'
+ )
+ partitioned by (dt)
+ location '/path/to/hudi'
+```
+
+**Create Table Options**
+
+| Parameter Name | Introduction |
+|------------|--------|
+| primaryKey | The primary key names of the table, multiple fields separated 
by commas. |
+| type       | The table type to create. type = 'cow' means a COPY-ON-WRITE 
table,while type = 'mor' means a MERGE-ON-READ table. Default value is 'cow' 
without specified this option.|
+| preCombineField | The Pre-Combine field of the table. |

Review comment:
       when you add info on CTAS, add a note that bulk_insert will be used with 
CTAS

##########
File path: website/docs/quick-start-guide.md
##########
@@ -263,6 +451,88 @@ df.write.format("hudi").
   save(basePath)
 ```
 
+</TabItem>
+<TabItem value="sparksql">
+
+Spark sql support two kinds of DML to udpate hudi table: Merge-Into and Update.
+
+###MergeInto
+
+Hudi support merge-into for both spark 2 & spark 3.
+
+**Syntax**
+
+```sql
+MERGE INTO tableIdentifier AS target_alias
+USING (sub_query | tableIdentifier) AS source_alias
+ON <merge_condition>
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]
+
+<merge_condition> =A equal bool condition 
+<matched_action>  =
+  DELETE  |
+  UPDATE SET *  |
+  UPDATE SET column1 = expression1 [, column2 = expression2 ...]
+<not_matched_action>  =
+  INSERT *  |
+  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
+```
+**Case**
+```sql
+merge into h0 as target
+using (
+  select id, name, price, flag from s
+) source
+on target.id = source.id
+when matched then update set *
+when not matched then insert *
+;
+
+merge into h0
+using (
+  select id, name, price, flag from s
+) source
+on h0.id = source.id
+when matched and flag != 'delete' then update set id = source.id, name = 
source.name, price = source.price * 2
+when matched and flag = 'delete' then delete
+when not matched then insert (id,name,price) values(id, name, price)
+;
+```
+**Notice**
+
+1、The merge-on condition must be the primary keys.
+2、Merge-On-Read table do not support partial. e.g.
+```sql
+ merge into h0 using s0
+ on h0.id = s0.id
+ when matched then update set price = s0.price * 2
+```
+This works well for Cow-On-Write table which support update only the **price** 
field. But it do not work
+for Merge-ON-READ table.

Review comment:
       Instead of saying something works for COW and does not work for MOR, we 
can word it differently. 
   "This works well for Copy_On_Write and support for Merge_On_Read will be 
added in future release". Can you revisit entire patch and fix all such phrases 
used. 

##########
File path: website/docs/quick-start-guide.md
##########
@@ -119,6 +142,85 @@ The 
[DataGenerator](https://github.com/apache/hudi/blob/master/hudi-spark/src/ma
 can generate sample inserts and updates based on the the sample trip schema 
[here](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
 :::
 
+## Create Table
+
+Hudi support create table using spark-sql.
+
+**Create Non-Partitioned Table**

Review comment:
       Also, add one line about cow and mor that these refer to 
HoodieTableTypes, namely COPY_ON_WRITE and MERGE_ON_READ.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> [SQL] Add Doc For Spark Sql Integrates With Hudi
> ------------------------------------------------
>
>                 Key: HUDI-2063
>                 URL: https://issues.apache.org/jira/browse/HUDI-2063
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Docs
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Blocker
>              Labels: pull-request-available, release-blocker
>             Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2063) [SQL] Add Doc For Spark Sql Integrates With Hudi

Reply via email to