bhasudha commented on code in PR #9622:
URL: https://github.com/apache/hudi/pull/9622#discussion_r1324512919
##########
website/docs/quick-start-guide.md:
##########
@@ -827,633 +789,348 @@ denoted by the timestamp. Look for changes in
`_hoodie_commit_time`, `rider`, `d
>
-## Incremental query
+## Delete data {#deletes}
-Hudi also provides capability to obtain a stream of records that changed since
given commit timestamp.
-This can be achieved using Hudi's incremental querying and providing a begin
time from which changes need to be streamed.
-We do not need to specify endTime, if we want all changes after the given
commit (as is the common case).
+Hard deletes physically remove any trace of the record from the table. For
example, this deletes records for the HoodieKeys passed in.
+Check out the [deletion section](/docs/writing_data#deletes) for more details.
+<br/><br/>
<Tabs
groupId="programming-language"
-defaultValue="python"
+defaultValue="scala"
values={[
{ label: 'Scala', value: 'scala', },
{ label: 'Python', value: 'python', },
-{ label: 'Spark SQL', value: 'sparksql', }
+{ label: 'Spark SQL', value: 'sparksql', },
]}
>
<TabItem value="scala">
+Delete records for the HoodieKeys passed in.<br/>
```scala
// spark-shell
-// reload data
-spark.
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+// fetch two records to be deleted
+val ds = spark.sql("select uuid, partitionpath from
hudi_trips_snapshot").limit(2)
+
+// issue deletes
+val deletes = dataGen.generateDeletes(ds.collectAsList())
+val hardDeleteDf = spark.read.json(spark.sparkContext.parallelize(deletes, 2))
+
+hardDeleteDf.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(OPERATION_OPT_KEY, "delete").
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME, tableName).
+ mode(Append).
+ save(basePath)
+
+// run the same read query as above.
+val roAfterDeleteViewDF = spark.
read.
format("hudi").
- load(basePath).
- createOrReplaceTempView("hudi_trips_snapshot")
+ load(basePath)
-val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime
from hudi_trips_snapshot order by commitTime").map(k =>
k.getString(0)).take(50)
-val beginTime = commits(commits.length - 2) // commit time we are interested in
+roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
+// fetch should return (total - 2) records
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+```
+:::note
+Only `Append` mode is supported for delete operation.
+:::
+</TabItem>
+<TabItem value="sparksql">
-// incrementally query data
-val tripsIncrementalDF = spark.read.format("hudi").
- option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
- option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
- load(basePath)
-tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
+**Syntax**
+```sql
+DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
+```
+**Example**
+```sql
+delete from hudi_cow_nonpcf_tbl where uuid = 1;
-spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_incremental where fare > 20.0").show()
+delete from hudi_mor_tbl where id % 2 = 0;
Review Comment:
ditto
##########
website/docs/quick-start-guide.md:
##########
@@ -827,633 +789,348 @@ denoted by the timestamp. Look for changes in
`_hoodie_commit_time`, `rider`, `d
>
-## Incremental query
+## Delete data {#deletes}
-Hudi also provides capability to obtain a stream of records that changed since
given commit timestamp.
-This can be achieved using Hudi's incremental querying and providing a begin
time from which changes need to be streamed.
-We do not need to specify endTime, if we want all changes after the given
commit (as is the common case).
+Hard deletes physically remove any trace of the record from the table. For
example, this deletes records for the HoodieKeys passed in.
+Check out the [deletion section](/docs/writing_data#deletes) for more details.
+<br/><br/>
<Tabs
groupId="programming-language"
-defaultValue="python"
+defaultValue="scala"
values={[
{ label: 'Scala', value: 'scala', },
{ label: 'Python', value: 'python', },
-{ label: 'Spark SQL', value: 'sparksql', }
+{ label: 'Spark SQL', value: 'sparksql', },
]}
>
<TabItem value="scala">
+Delete records for the HoodieKeys passed in.<br/>
```scala
// spark-shell
-// reload data
-spark.
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+// fetch two records to be deleted
+val ds = spark.sql("select uuid, partitionpath from
hudi_trips_snapshot").limit(2)
+
+// issue deletes
+val deletes = dataGen.generateDeletes(ds.collectAsList())
+val hardDeleteDf = spark.read.json(spark.sparkContext.parallelize(deletes, 2))
+
+hardDeleteDf.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(OPERATION_OPT_KEY, "delete").
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME, tableName).
+ mode(Append).
+ save(basePath)
+
+// run the same read query as above.
+val roAfterDeleteViewDF = spark.
read.
format("hudi").
- load(basePath).
- createOrReplaceTempView("hudi_trips_snapshot")
+ load(basePath)
-val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime
from hudi_trips_snapshot order by commitTime").map(k =>
k.getString(0)).take(50)
-val beginTime = commits(commits.length - 2) // commit time we are interested in
+roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
+// fetch should return (total - 2) records
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+```
+:::note
+Only `Append` mode is supported for delete operation.
+:::
+</TabItem>
+<TabItem value="sparksql">
-// incrementally query data
-val tripsIncrementalDF = spark.read.format("hudi").
- option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
- option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
- load(basePath)
-tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
+**Syntax**
+```sql
+DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
+```
+**Example**
+```sql
+delete from hudi_cow_nonpcf_tbl where uuid = 1;
-spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_incremental where fare > 20.0").show()
+delete from hudi_mor_tbl where id % 2 = 0;
+
+-- delete using non-PK field
+delete from hudi_cow_pt_tbl where name = 'a1';
Review Comment:
ditto
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]