yihua commented on code in PR #6547:
URL: https://github.com/apache/hudi/pull/6547#discussion_r958998148
##########
website/docs/quick-start-guide.md:
##########
@@ -958,6 +959,134 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon,
begin_lat, ts from hud
## Delete data {#deletes}
+Apache Hudi supports two types of deletes: (1) **Soft Deletes**: retaining the
record key and just null out the values
+for all the other fields; (2) **Hard Deletes**: physically removing any trace
of the record from the table.
+See the [deletion section](/docs/writing_data#deletes) of the writing data
page for more details.
+
+### Soft Deletes
+
+<Tabs
+defaultValue="scala"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'Python', value: 'python', }
+]}
+>
+
+<TabItem value="scala">
+
+```scala
+// spark-shell
+spark.
+ read.
+ format("hudi").
+ load(basePath).
+ createOrReplaceTempView("hudi_trips_snapshot")
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is
not null").count()
+// fetch two records for soft deletes
+val softDeleteDs = spark.sql("select * from hudi_trips_snapshot").limit(2)
+
+// prepare the soft deletes by ensuring the appropriate fields are nullified
+val nullifyColumns = softDeleteDs.schema.fields.
+ map(field => (field.name, field.dataType.typeName)).
+ filter(pair => (!HoodieRecord.HOODIE_META_COLUMNS.contains(pair._1)
+ && !Array("ts", "uuid", "partitionpath").contains(pair._1)))
+
+val softDeleteDf = nullifyColumns.
+ foldLeft(softDeleteDs.drop(HoodieRecord.HOODIE_META_COLUMNS: _*))(
+ (ds, col) => ds.withColumn(col._1, lit(null).cast(col._2)))
+
+// simply upsert the table after setting these fields to null
+softDeleteDf.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(OPERATION_OPT_KEY, "upsert").
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME, tableName).
+ mode(Append).
+ save(basePath)
+
+// reload data
+spark.
+ read.
+ format("hudi").
+ load(basePath).
+ createOrReplaceTempView("hudi_trips_snapshot")
+
+// fetch should return total and (total - 2) records for the two queries
respectively
Review Comment:
I added the notes as suggested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]