[GitHub] [hudi] nsivabalan commented on a diff in pull request #6547: [HUDI-4748][DOCS] Add examples of soft deletes in docs

GitBox Tue, 30 Aug 2022 15:23:21 -0700


nsivabalan commented on code in PR #6547:
URL: https://github.com/apache/hudi/pull/6547#discussion_r958977056



##########
website/docs/quick-start-guide.md:
##########
@@ -958,6 +959,134 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, 
begin_lat, ts from hud
 
 ## Delete data {#deletes}
 
+Apache Hudi supports two types of deletes: (1) **Soft Deletes**: retaining the 
record key and just null out the values
+for all the other fields; (2) **Hard Deletes**: physically removing any trace 
of the record from the table.
+See the [deletion section](/docs/writing_data#deletes) of the writing data 
page for more details.
+
+### Soft Deletes
+
+<Tabs
+defaultValue="scala"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'Python', value: 'python', }
+]}
+>
+
+<TabItem value="scala">
+
+```scala
+// spark-shell
+spark.
+  read.
+  format("hudi").
+  load(basePath).
+  createOrReplaceTempView("hudi_trips_snapshot")
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is 
not null").count()
+// fetch two records for soft deletes
+val softDeleteDs = spark.sql("select * from hudi_trips_snapshot").limit(2)
+
+// prepare the soft deletes by ensuring the appropriate fields are nullified
+val nullifyColumns = softDeleteDs.schema.fields.
+  map(field => (field.name, field.dataType.typeName)).
+  filter(pair => (!HoodieRecord.HOODIE_META_COLUMNS.contains(pair._1)
+    && !Array("ts", "uuid", "partitionpath").contains(pair._1)))
+
+val softDeleteDf = nullifyColumns.
+  foldLeft(softDeleteDs.drop(HoodieRecord.HOODIE_META_COLUMNS: _*))(
+    (ds, col) => ds.withColumn(col._1, lit(null).cast(col._2)))
+
+// simply upsert the table after setting these fields to null
+softDeleteDf.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(OPERATION_OPT_KEY, "upsert").
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+
+// reload data
+spark.
+  read.
+  format("hudi").
+  load(basePath).
+  createOrReplaceTempView("hudi_trips_snapshot")
+
+// fetch should return total and (total - 2) records for the two queries 
respectively

Review Comment:
   minor. this query will return `total` records and not `total - 2`



##########
website/docs/writing_data.md:
##########
@@ -297,6 +297,33 @@ For more info refer to [Delete support in 
Hudi](https://cwiki.apache.org/conflue
 - **Soft Deletes** : Retain the record key and just null out the values for 
all the other fields.
   This can be achieved by ensuring the appropriate fields are nullable in the 
table schema and simply upserting the table after setting these fields to null.
 
+For example:
+```scala
+// fetch two records for soft deletes
+val softDeleteDs = spark.sql("select * from hudi_trips_snapshot").limit(2)
+
+// prepare the soft deletes by ensuring the appropriate fields are nullified
+val nullifyColumns = softDeleteDs.schema.fields.
+  map(field => (field.name, field.dataType.typeName)).
+  filter(pair => (!HoodieRecord.HOODIE_META_COLUMNS.contains(pair._1)
+    && !Array("ts", "uuid", "partitionpath").contains(pair._1)))
+
+val softDeleteDf = nullifyColumns.
+  foldLeft(softDeleteDs.drop(HoodieRecord.HOODIE_META_COLUMNS: _*))(
+    (ds, col) => ds.withColumn(col._1, lit(null).cast(col._2)))
+
+// simply upsert the table after setting these fields to null
+softDeleteDf.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(OPERATION_OPT_KEY, "upsert").
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+```
+

Review Comment:
   you can also add the same note here as well. 



##########
website/docs/quick-start-guide.md:
##########
@@ -958,6 +959,134 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, 
begin_lat, ts from hud
 
 ## Delete data {#deletes}
 
+Apache Hudi supports two types of deletes: (1) **Soft Deletes**: retaining the 
record key and just null out the values
+for all the other fields; (2) **Hard Deletes**: physically removing any trace 
of the record from the table.
+See the [deletion section](/docs/writing_data#deletes) of the writing data 
page for more details.
+
+### Soft Deletes
+
+<Tabs
+defaultValue="scala"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'Python', value: 'python', }
+]}
+>
+
+<TabItem value="scala">
+
+```scala
+// spark-shell
+spark.
+  read.
+  format("hudi").
+  load(basePath).
+  createOrReplaceTempView("hudi_trips_snapshot")
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is 
not null").count()
+// fetch two records for soft deletes
+val softDeleteDs = spark.sql("select * from hudi_trips_snapshot").limit(2)
+
+// prepare the soft deletes by ensuring the appropriate fields are nullified
+val nullifyColumns = softDeleteDs.schema.fields.
+  map(field => (field.name, field.dataType.typeName)).
+  filter(pair => (!HoodieRecord.HOODIE_META_COLUMNS.contains(pair._1)
+    && !Array("ts", "uuid", "partitionpath").contains(pair._1)))
+
+val softDeleteDf = nullifyColumns.
+  foldLeft(softDeleteDs.drop(HoodieRecord.HOODIE_META_COLUMNS: _*))(
+    (ds, col) => ds.withColumn(col._1, lit(null).cast(col._2)))
+
+// simply upsert the table after setting these fields to null
+softDeleteDf.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(OPERATION_OPT_KEY, "upsert").
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+
+// reload data
+spark.
+  read.
+  format("hudi").
+  load(basePath).
+  createOrReplaceTempView("hudi_trips_snapshot")
+
+// fetch should return total and (total - 2) records for the two queries 
respectively
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is 
not null").count()
+```
+:::note
+Notice that the save mode is `Append`.
+:::
+</TabItem>
+<TabItem value="python">
+
+```python
+# pyspark
+from pyspark.sql.functions import lit
+from functools import reduce
+
+spark.read.format("hudi"). \
+  load(basePath). \
+  createOrReplaceTempView("hudi_trips_snapshot")
+# fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is 
not null").count()
+# fetch two records for soft deletes
+soft_delete_ds = spark.sql("select * from hudi_trips_snapshot").limit(2)
+
+# prepare the soft deletes by ensuring the appropriate fields are nullified
+meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno", 
"_hoodie_record_key", \
+  "_hoodie_partition_path", "_hoodie_file_name"]
+excluded_columns = meta_columns + ["ts", "uuid", "partitionpath"]
+nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, \
+  list(map(lambda field: (field.name, field.dataType), 
softDeleteDs.schema.fields))))
+
+hudi_soft_delete_options = {
+  'hoodie.table.name': tableName,
+  'hoodie.datasource.write.recordkey.field': 'uuid',
+  'hoodie.datasource.write.partitionpath.field': 'partitionpath',
+  'hoodie.datasource.write.table.name': tableName,
+  'hoodie.datasource.write.operation': 'upsert',
+  'hoodie.datasource.write.precombine.field': 'ts',
+  'hoodie.upsert.shuffle.parallelism': 2, 
+  'hoodie.insert.shuffle.parallelism': 2
+}
+
+soft_delete_df = reduce(lambda df,col: df.withColumn(col[0], 
lit(None).cast(col[1])), \
+  nullify_columns, reduce(lambda df,col: df.drop(col[0]), meta_columns, 
soft_delete_ds))
+
+# simply upsert the table after setting these fields to null
+soft_delete_df.write.format("hudi"). \
+  options(**hudi_soft_delete_options). \
+  mode("append"). \
+  save(basePath)
+
+# reload data
+spark.read.format("hudi"). \
+  load(basePath). \
+  createOrReplaceTempView("hudi_trips_snapshot")
+
+# fetch should return total and (total - 2) records for the two queries 
respectively

Review Comment:
   same here



##########
website/docs/quick-start-guide.md:
##########
@@ -958,6 +959,134 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, 
begin_lat, ts from hud
 
 ## Delete data {#deletes}
 
+Apache Hudi supports two types of deletes: (1) **Soft Deletes**: retaining the 
record key and just null out the values
+for all the other fields; (2) **Hard Deletes**: physically removing any trace 
of the record from the table.
+See the [deletion section](/docs/writing_data#deletes) of the writing data 
page for more details.
+
+### Soft Deletes
+
+<Tabs
+defaultValue="scala"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'Python', value: 'python', }
+]}
+>
+
+<TabItem value="scala">
+
+```scala
+// spark-shell
+spark.
+  read.
+  format("hudi").
+  load(basePath).
+  createOrReplaceTempView("hudi_trips_snapshot")
+// fetch total records count
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
+spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is 
not null").count()
+// fetch two records for soft deletes
+val softDeleteDs = spark.sql("select * from hudi_trips_snapshot").limit(2)
+
+// prepare the soft deletes by ensuring the appropriate fields are nullified
+val nullifyColumns = softDeleteDs.schema.fields.
+  map(field => (field.name, field.dataType.typeName)).
+  filter(pair => (!HoodieRecord.HOODIE_META_COLUMNS.contains(pair._1)
+    && !Array("ts", "uuid", "partitionpath").contains(pair._1)))
+
+val softDeleteDf = nullifyColumns.
+  foldLeft(softDeleteDs.drop(HoodieRecord.HOODIE_META_COLUMNS: _*))(
+    (ds, col) => ds.withColumn(col._1, lit(null).cast(col._2)))
+
+// simply upsert the table after setting these fields to null
+softDeleteDf.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(OPERATION_OPT_KEY, "upsert").
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+
+// reload data
+spark.
+  read.
+  format("hudi").
+  load(basePath).
+  createOrReplaceTempView("hudi_trips_snapshot")
+
+// fetch should return total and (total - 2) records for the two queries 
respectively

Review Comment:
   can we also add a note that "Soft deletes will always be persisted in 
storage and never removed, but all values will be set to nulls. So for GDPR or 
other compliance reasons, users should consider doing hard deletes if need be"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6547: [HUDI-4748][DOCS] Add examples of soft deletes in docs

Reply via email to