[GitHub] [hudi] bhasudha commented on a diff in pull request #9622: [HUDI-6851] Fixing Spark quick start guide

via GitHub Wed, 13 Sep 2023 06:26:10 -0700


bhasudha commented on code in PR #9622:
URL: https://github.com/apache/hudi/pull/9622#discussion_r1324512919



##########
website/docs/quick-start-guide.md:
##########
@@ -827,633 +789,348 @@ denoted by the timestamp. Look for changes in 
`_hoodie_commit_time`, `rider`, `d
 >
 
 
-## Incremental query
+## Delete data {#deletes}
 
-Hudi also provides capability to obtain a stream of records that changed since 
given commit timestamp. 
-This can be achieved using Hudi's incremental querying and providing a begin 
time from which changes need to be streamed. 
-We do not need to specify endTime, if we want all changes after the given 
commit (as is the common case). 
+Hard deletes physically remove any trace of the record from the table. For 
example, this deletes records for the HoodieKeys passed in.
+Check out the [deletion section](/docs/writing_data#deletes) for more details.
+<br/><br/>
 
 <Tabs
 groupId="programming-language"
-defaultValue="python"
+defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
-{ label: 'Spark SQL', value: 'sparksql', }
+{ label: 'Spark SQL', value: 'sparksql', },
 ]}
 >
 
 <TabItem value="scala">
+Delete records for the HoodieKeys passed in.<br/>
 
 ```scala
 // spark-shell
-// reload data
-spark.
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+// fetch two records to be deleted
+val ds = spark.sql("select uuid, partitionpath from 
hudi_trips_snapshot").limit(2)
+
+// issue deletes
+val deletes = dataGen.generateDeletes(ds.collectAsList())
+val hardDeleteDf = spark.read.json(spark.sparkContext.parallelize(deletes, 2))
+
+hardDeleteDf.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(OPERATION_OPT_KEY, "delete").
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+
+// run the same read query as above.
+val roAfterDeleteViewDF = spark.
   read.
   format("hudi").
-  load(basePath).
-  createOrReplaceTempView("hudi_trips_snapshot")
+  load(basePath)
 
-val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime 
from  hudi_trips_snapshot order by commitTime").map(k => 
k.getString(0)).take(50)
-val beginTime = commits(commits.length - 2) // commit time we are interested in
+roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
+// fetch should return (total - 2) records
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+```
+:::note
+Only `Append` mode is supported for delete operation.
+:::
+</TabItem>
+<TabItem value="sparksql">
 
-// incrementally query data
-val tripsIncrementalDF = spark.read.format("hudi").
-  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
-  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
-  load(basePath)
-tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
+**Syntax**
+```sql
+DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
+```
+**Example**
+```sql
+delete from hudi_cow_nonpcf_tbl where uuid = 1;
 
-spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()
+delete from hudi_mor_tbl where id % 2 = 0;

Review Comment:
   ditto



##########
website/docs/quick-start-guide.md:
##########
@@ -827,633 +789,348 @@ denoted by the timestamp. Look for changes in 
`_hoodie_commit_time`, `rider`, `d
 >
 
 
-## Incremental query
+## Delete data {#deletes}
 
-Hudi also provides capability to obtain a stream of records that changed since 
given commit timestamp. 
-This can be achieved using Hudi's incremental querying and providing a begin 
time from which changes need to be streamed. 
-We do not need to specify endTime, if we want all changes after the given 
commit (as is the common case). 
+Hard deletes physically remove any trace of the record from the table. For 
example, this deletes records for the HoodieKeys passed in.
+Check out the [deletion section](/docs/writing_data#deletes) for more details.
+<br/><br/>
 
 <Tabs
 groupId="programming-language"
-defaultValue="python"
+defaultValue="scala"
 values={[
 { label: 'Scala', value: 'scala', },
 { label: 'Python', value: 'python', },
-{ label: 'Spark SQL', value: 'sparksql', }
+{ label: 'Spark SQL', value: 'sparksql', },
 ]}
 >
 
 <TabItem value="scala">
+Delete records for the HoodieKeys passed in.<br/>
 
 ```scala
 // spark-shell
-// reload data
-spark.
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+// fetch two records to be deleted
+val ds = spark.sql("select uuid, partitionpath from 
hudi_trips_snapshot").limit(2)
+
+// issue deletes
+val deletes = dataGen.generateDeletes(ds.collectAsList())
+val hardDeleteDf = spark.read.json(spark.sparkContext.parallelize(deletes, 2))
+
+hardDeleteDf.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(OPERATION_OPT_KEY, "delete").
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+
+// run the same read query as above.
+val roAfterDeleteViewDF = spark.
   read.
   format("hudi").
-  load(basePath).
-  createOrReplaceTempView("hudi_trips_snapshot")
+  load(basePath)
 
-val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime 
from  hudi_trips_snapshot order by commitTime").map(k => 
k.getString(0)).take(50)
-val beginTime = commits(commits.length - 2) // commit time we are interested in
+roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
+// fetch should return (total - 2) records
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+```
+:::note
+Only `Append` mode is supported for delete operation.
+:::
+</TabItem>
+<TabItem value="sparksql">
 
-// incrementally query data
-val tripsIncrementalDF = spark.read.format("hudi").
-  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
-  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
-  load(basePath)
-tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
+**Syntax**
+```sql
+DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
+```
+**Example**
+```sql
+delete from hudi_cow_nonpcf_tbl where uuid = 1;
 
-spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()
+delete from hudi_mor_tbl where id % 2 = 0;
+
+-- delete using non-PK field
+delete from hudi_cow_pt_tbl where name = 'a1';

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a diff in pull request #9622: [HUDI-6851] Fixing Spark quick start guide

Reply via email to