[hudi] branch asf-site updated: updated writing pages with more details on delete and descibing the full write path (#4250)

bhavanisudha Wed, 08 Dec 2021 19:43:23 -0800

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 69e8906  updated writing pages with more details on delete and 
descibing the full write path (#4250)
69e8906 is described below

commit 69e8906193bda0bcf2f443e61ba78d71e4d64e1b
Author: Kyle Weller <[email protected]>
AuthorDate: Wed Dec 8 19:43:02 2021 -0800

    updated writing pages with more details on delete and descibing the full 
write path (#4250)
---
 website/docs/write_operations.md | 62 ++++++++++++++++++++++-----
 website/docs/writing_data.md     | 93 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 137 insertions(+), 18 deletions(-)

diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md
index 06176e5..ccdac23 100644
--- a/website/docs/write_operations.md
+++ b/website/docs/write_operations.md
@@ -5,16 +5,56 @@ toc: true
 last_modified_at:
 ---
 
-It may be helpful to understand the 3 different write operations provided by 
Hudi datasource or the delta streamer tool and how best to leverage them. These 
operations
-can be chosen/changed across each commit/deltacommit issued against the table.
+It may be helpful to understand the different write operations of Hudi and how 
best to leverage them. These operations
+can be chosen/changed across each commit/deltacommit issued against the table. 
See the [How To docs on Writing Data](/docs/writing_data) 
+to see more examples.
 
+## Operation Types
+### UPSERT 
+This is the default operation where the input records are first tagged as 
inserts or updates by looking up the index.
+The records are ultimately written after heuristics are run to determine how 
best to pack them on storage to optimize for things like file sizing.
+This operation is recommended for use-cases like database change capture where 
the input almost certainly contains updates. The target table will never show 
duplicates. 
 
-- **UPSERT** : This is the default operation where the input records are first 
tagged as inserts or updates by looking up the index.
-  The records are ultimately written after heuristics are run to determine how 
best to pack them on storage to optimize for things like file sizing.
-  This operation is recommended for use-cases like database change capture 
where the input almost certainly contains updates. The target table will never 
show duplicates.
-- **INSERT** : This operation is very similar to upsert in terms of 
heuristics/file sizing but completely skips the index lookup step. Thus, it can 
be a lot faster than upserts
-  for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
table can tolerate duplicates, but just
-  need the transactional writes/incremental pull/storage management 
capabilities of Hudi.
-- **BULK_INSERT** : Both upsert and insert operations keep input records in 
memory to speed up storage heuristics computations faster (among other things) 
and thus can be cumbersome for
-  initial loading/bootstrapping a Hudi table at first. Bulk insert provides 
the same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs
-  of initial load. However, this just does a best-effort job at sizing files 
vs guaranteeing file sizes like inserts/upserts do.
+### INSERT
+This operation is very similar to upsert in terms of heuristics/file sizing 
but completely skips the index lookup step. Thus, it can be a lot faster than 
upserts
+for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
table can tolerate duplicates, but just
+need the transactional writes/incremental pull/storage management capabilities 
of Hudi.
+
+### BULK_INSERT
+Both upsert and insert operations keep input records in memory to speed up 
storage heuristics computations faster (among other things) and thus can be 
cumbersome for
+initial loading/bootstrapping a Hudi table at first. Bulk insert provides the 
same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs
+of initial load. However, this just does a best-effort job at sizing files vs 
guaranteeing file sizes like inserts/upserts do.
+
+### DELETE
+Hudi supports implementing two types of deletes on data stored in Hudi tables, 
by enabling the user to specify a different record payload implementation.
+- **Soft Deletes** : Retain the record key and just null out the values for 
all the other fields.
+  This can be achieved by ensuring the appropriate fields are nullable in the 
table schema and simply upserting the table after setting these fields to null.
+- **Hard Deletes** : A stronger form of deletion is to physically remove any 
trace of the record from the table. This can be achieved in 3 different ways. 
+  - Using DataSource, set `OPERATION_OPT_KEY` to `DELETE_OPERATION_OPT_VAL`. 
This will remove all the records in the DataSet being submitted. 
+  - Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to 
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records 
in the DataSet being submitted. 
+  - Using DataSource or DeltaStreamer, add a column named `_hoodie_is_deleted` 
to DataSet. The value of this column must be set to `true` for all the records 
to be deleted and either `false` or left null for any records which are to be 
upserted.
+
+## Writing path
+The following is an inside look on the Hudi write path and the sequence of 
events that occur during a write.
+
+1. [Deduping](/docs/configurations/#writeinsertdeduplicate)
+   1. First your input records may have duplicate keys within the same batch 
and duplicates need to be combined or reduced by key.
+2. [Index Lookup](/docs/next/indexing)
+   1. Next, an index lookup is performed to try and match the input records to 
identify which file groups they belong to.
+3. [File Sizing](/docs/next/file_sizing)
+   1. Then, based on the average size of previous commits, Hudi will make a 
plan to add enough records to a small file to get it close to the configured 
maximum limit.
+4. [Partitioning](/docs/next/file_layouts)
+   1. We now arrive at partitioning where we decide what file groups certain 
updates and inserts will be placed in or if new file groups will be created
+5. Write I/O
+   1. Now we actually do the write operations which is either creating a new 
base file, appending to the log file,
+   or versioning an existing base file.
+6. Update [Index](/docs/next/indexing)
+   1. Now that the write is performed, we will go back and update the index.
+7. Commit
+   1. Finally we commit all of these changes atomically. (A [callback 
notification](/docs/next/writing_data#commit-notifications) is exposed)
+8. [Clean](/docs/next/hoodie_cleaner) (if needed)
+   1. Following the commit, cleaning is invoked if needed.
+9. [Compaction](/docs/next/compaction)
+   1. If you are using MOR tables, compaction will either run inline, or be 
scheduled asynchronously
+10. Archive
+    1. Lastly, we perform an archival step which moves old 
[timeline](/docs/next/timeline) items to an archive folder.
diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md
index a049a8a..15fcc4d 100644
--- a/website/docs/writing_data.md
+++ b/website/docs/writing_data.md
@@ -297,22 +297,101 @@ For more info refer to [Delete support in 
Hudi](https://cwiki.apache.org/conflue
 - **Soft Deletes** : Retain the record key and just null out the values for 
all the other fields.
   This can be achieved by ensuring the appropriate fields are nullable in the 
table schema and simply upserting the table after setting these fields to null.
 
-- **Hard Deletes** : A stronger form of deletion is to physically remove any 
trace of the record from the table. This can be achieved in 3 different ways.
+- **Hard Deletes** : A stronger form of deletion is to physically remove any 
trace of the record from the table. This can be achieved in 3 different ways. 
 
-    1) Using DataSource, set `OPERATION_OPT_KEY` to 
`DELETE_OPERATION_OPT_VAL`. This will remove all the records in the DataSet 
being submitted.
+1. Using Datasource, set `OPERATION_OPT_KEY` to `DELETE_OPERATION_OPT_VAL`. 
This will remove all the records in the DataSet being submitted.
 
-    2) Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to 
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records 
in the DataSet being submitted.
+Example, first read in a dataset:
+```scala
+val roViewDF = spark.
+        read.
+        format("org.apache.hudi").
+        load(basePath + "/*/*/*/*")
+roViewDF.createOrReplaceTempView("hudi_ro_table")
+spark.sql("select count(*) from hudi_ro_table").show() // should return 10 
(number of records inserted above)
+val riderValue = spark.sql("select distinct rider from hudi_ro_table").show()
+// copy the value displayed to be used in next step
+```
+Now write a query of which records you would like to delete:
+```scala
+val df = spark.sql("select uuid, partitionPath from hudi_ro_table where rider 
= 'rider-213'")
+```
+Lastly, execute the deletion of these records:
+```scala
+val deletes = dataGen.generateDeletes(df.collectAsList())
+val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2));
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(OPERATION_OPT_KEY,"delete").
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Append).
+save(basePath);
+```
 
-    3) Using DataSource or DeltaStreamer, add a column named 
`_hoodie_is_deleted` to DataSet. The value of this column must be set to `true` 
for all the records to be deleted and either `false` or left null for any 
records which are to be upserted.
+2. Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to 
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records 
in the DataSet being submitted. 
 
-Example using hard delete method 2, remove all the records from the table that 
exist in the DataSet `deleteDF`:
-```java
+This example will remove all the records from the table that exist in the 
DataSet `deleteDF`:
+```scala
  deleteDF // dataframe containing just records to be deleted
    .write().format("org.apache.hudi")
    .option(...) // Add HUDI options like record-key, partition-path and others 
as needed for your setup
    // specify record_key, partition_key, precombine_fieldkey & usual params
    .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, 
"org.apache.hudi.EmptyHoodieRecordPayload")
- 
+```
+
+3. Using DataSource or DeltaStreamer, add a column named `_hoodie_is_deleted` 
to DataSet. The value of this column must be set to `true` for all the records 
to be deleted and either `false` or left null for any records which are to be 
upserted.
+
+Let's say the original schema is:
+```json
+{
+  "type":"record",
+  "name":"example_tbl",
+  "fields":[{
+     "name": "uuid",
+     "type": "String"
+  }, {
+     "name": "ts",
+     "type": "string"
+  },  {
+     "name": "partitionPath",
+     "type": "string"
+  }, {
+     "name": "rank",
+     "type": "long"
+  }
+]}
+```
+Make sure you add `_hoodie_is_deleted` column:
+```json
+{
+  "type":"record",
+  "name":"example_tbl",
+  "fields":[{
+     "name": "uuid",
+     "type": "String"
+  }, {
+     "name": "ts",
+     "type": "string"
+  },  {
+     "name": "partitionPath",
+     "type": "string"
+  }, {
+     "name": "rank",
+     "type": "long"
+  }, {
+    "name" : "_hoodie_is_deleted",
+    "type" : "boolean",
+    "default" : false
+  }
+]}
+```
+
+Then any record you want to delete you can mark `_hoodie_is_deleted` as true:
+```json
+{"ts": 0.0, "uuid": "19tdb048-c93e-4532-adf9-f61ce6afe10", "rank": 1045, 
"partitionpath": "americas/brazil/sao_paulo", "_hoodie_is_deleted" : true}
 ```
 
 ### Concurrency Control

[hudi] branch asf-site updated: updated writing pages with more details on delete and descibing the full write path (#4250)

Reply via email to