[jira] [Commented] (HUDI-2063) [SQL] Add Doc For Spark Sql Integrates With Hudi

ASF GitHub Bot (Jira) Wed, 11 Aug 2021 04:54:39 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397295#comment-17397295
 ]


ASF GitHub Bot commented on HUDI-2063:
--------------------------------------

pengzhiwei2018 commented on a change in pull request #3140:
URL: https://github.com/apache/hudi/pull/3140#discussion_r686733674



##########
File path: website/docs/querying_data.md
##########
@@ -119,6 +119,11 @@ By default, Spark SQL will try to use its own parquet 
reader instead of Hive Ser
 both parquet and avro data, this default setting needs to be turned off using 
set `spark.sql.hive.convertMetastoreParquet=false`. 
 This will force Spark to fallback to using the Hive Serde to read the data 
(planning/executions is still Spark). 
 
+**NOTICE**
+
+Since 0.9.0 hudi will sync the table to hive as a spark datasource table. So 
we do not need the `spark.sql.hive.convertMetastoreParquet=false`

Review comment:
       Well, for an existing old table,  the hive sync will not update the 
properties of spark datasource table to the table. So they still need this 
config when querying.

##########
File path: website/docs/quick-start-guide.md
##########
@@ -263,6 +451,88 @@ df.write.format("hudi").
   save(basePath)
 ```
 
+</TabItem>
+<TabItem value="sparksql">
+
+Spark sql support two kinds of DML to udpate hudi table: Merge-Into and Update.
+
+###MergeInto
+
+Hudi support merge-into for both spark 2 & spark 3.
+
+**Syntax**
+
+```sql
+MERGE INTO tableIdentifier AS target_alias
+USING (sub_query | tableIdentifier) AS source_alias
+ON <merge_condition>
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
+[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]
+
+<merge_condition> =A equal bool condition 
+<matched_action>  =
+  DELETE  |
+  UPDATE SET *  |
+  UPDATE SET column1 = expression1 [, column2 = expression2 ...]
+<not_matched_action>  =
+  INSERT *  |
+  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
+```
+**Case**
+```sql
+merge into h0 as target
+using (
+  select id, name, price, flag from s
+) source
+on target.id = source.id
+when matched then update set *
+when not matched then insert *
+;
+
+merge into h0
+using (
+  select id, name, price, flag from s
+) source
+on h0.id = source.id
+when matched and flag != 'delete' then update set id = source.id, name = 
source.name, price = source.price * 2
+when matched and flag = 'delete' then delete
+when not matched then insert (id,name,price) values(id, name, price)
+;
+```
+**Notice**
+
+1、The merge-on condition must be the primary keys.
+2、Merge-On-Read table do not support partial. e.g.
+```sql
+ merge into h0 using s0
+ on h0.id = s0.id
+ when matched then update set price = s0.price * 2
+```
+This works well for Cow-On-Write table which support update only the **price** 
field. But it do not work
+for Merge-ON-READ table.

Review comment:
       Make sense.

##########
File path: website/docs/quick-start-guide.md
##########
@@ -22,26 +22,49 @@ defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
+{ label: 'SparkSQL', value: 'sparksql', },
 ]}>
 <TabItem value="scala">
 
 ```scala
 // spark-shell for spark 3
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
+  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1
 \

Review comment:
       OK

##########
File path: website/docs/querying_data.md
##########
@@ -141,14 +146,13 @@ and executors. Alternatively, hudi-spark-bundle can also 
fetched via the `--pack
 
 ### Snapshot query {#spark-snap-query}
 This method can be used to retrieve the data table at the present point in 
time.
-Note: The file path must be suffixed with a number of wildcard asterisk (`/*`) 
one greater than the number of partition levels. Eg: with table file path 
"tablePath" partitioned by columns "a", "b", and "c", the load path must be 
`tablePath + "/*/*/*/*"`
 
 ```scala
 val hudiIncQueryDF = spark
      .read()
-     .format("org.apache.hudi")
+     .format("hudi")
      .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
-     .load(tablePath + "/*") //The number of wildcard asterisks here must be 
one greater than the number of partition
+     .load(tablePath) 

Review comment:
       Yes, since 0.9.0 version, we support HoodieFileIndex, which does not 
need the "*"

##########
File path: website/docs/quick-start-guide.md
##########
@@ -22,26 +22,49 @@ defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
+{ label: 'SparkSQL', value: 'sparksql', },
 ]}>
 <TabItem value="scala">
 
 ```scala
 // spark-shell for spark 3
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
+  --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
 // spark-shell for spark 2 with scala 2.12
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.4
 \
+  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:2.4.4
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
 // spark-shell for spark 2 with scala 2.11
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.8.0,org.apache.spark:spark-avro_2.11:2.4.4
 \
+  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.9.0,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
+</TabItem>
+<TabItem value="sparksql">
+
+Hudi support using spark sql to write and read data with the 
**HoodieSparkSessionExtension** sql extension.
+```shell
+# spark sql for spark 3
+spark-sql --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1
 \

Review comment:
       Ok

##########
File path: website/docs/quick-start-guide.md
##########
@@ -146,6 +249,55 @@ df.write.format("hudi").
   save(basePath)
 ``` 
 
+</TabItem>
+<TabItem value="sparksql">
+    
+```sql
+insert into h0 select 1, 'a1', 20;
+
+-- insert static partition
+insert into h_p0 partition(dt = '2021-01-02') select 1, 'a1';
+
+-- insert dynamic partition
+insert into h_p0 select 1, 'a1', dt;
+
+-- insert dynamic partition
+insert into h_p1 select 1 as id, 'a1', '2021-01-03' as dt, '19' as hh;
+
+-- insert overwrite table
+insert overwrite table h0 select 1, 'a1', 20;
+
+-- insert overwrite table with static partition
+insert overwrite h_p0 partition(dt = '2021-01-02') select 1, 'a1';
+
+- insert overwrite table with dynamic partition
+  insert overwrite table h_p1 select 2 as id, 'a2', '2021-01-03' as dt, '19' 
as hh;
+```
+
+**NOTICE**
+
+1. Insert mode
+
+Hudi support three insert modes when inserting data to a table with primary 
key(we call it pk-table as followed):
+- upsert
+  
+  This it the default insert mode. For upsert mode, insert statement do the 
upsert operation for the pk-table which will update the duplicate record
+- strict
+
+For strict mode, insert statement will keep the primary key uniqueness 
constraint for COW table which do not allow duplicate record.
+If inserting a record which the primary key is already exists to the table, a 
HoodieDuplicateKeyException will throw out
+for COW table. For MOR table, it has the same behavior with "upsert" mode.
+
+- non-strict
+
+For non-strict mode, hudi just do the insert operation for the pk-table.
+
+We can set the inset mode by the config: **hoodie.sql.insert.mode**

Review comment:
       done!

##########
File path: website/docs/quick-start-guide.md
##########
@@ -533,12 +816,16 @@ df.write.format("hudi").
 // Should have different keys now, from query before.
 spark.
   read.format("hudi").
-  load(basePath + "/*/*/*/*").
+  load(basePath).
   select("uuid","partitionpath").
   show(10, false)
 
 ``` 
 
+**NOTICE**
+
+The insert overwrite non-partitioned table sql statement will convert to the 
***insert_overwrite_table*** operation.

Review comment:
       make sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> [SQL] Add Doc For Spark Sql Integrates With Hudi
> ------------------------------------------------
>
>                 Key: HUDI-2063
>                 URL: https://issues.apache.org/jira/browse/HUDI-2063
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Docs
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Blocker
>              Labels: pull-request-available, release-blocker
>             Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2063) [SQL] Add Doc For Spark Sql Integrates With Hudi

Reply via email to