This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 69e8906 updated writing pages with more details on delete and
descibing the full write path (#4250)
69e8906 is described below
commit 69e8906193bda0bcf2f443e61ba78d71e4d64e1b
Author: Kyle Weller <[email protected]>
AuthorDate: Wed Dec 8 19:43:02 2021 -0800
updated writing pages with more details on delete and descibing the full
write path (#4250)
---
website/docs/write_operations.md | 62 ++++++++++++++++++++++-----
website/docs/writing_data.md | 93 +++++++++++++++++++++++++++++++++++++---
2 files changed, 137 insertions(+), 18 deletions(-)
diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md
index 06176e5..ccdac23 100644
--- a/website/docs/write_operations.md
+++ b/website/docs/write_operations.md
@@ -5,16 +5,56 @@ toc: true
last_modified_at:
---
-It may be helpful to understand the 3 different write operations provided by
Hudi datasource or the delta streamer tool and how best to leverage them. These
operations
-can be chosen/changed across each commit/deltacommit issued against the table.
+It may be helpful to understand the different write operations of Hudi and how
best to leverage them. These operations
+can be chosen/changed across each commit/deltacommit issued against the table.
See the [How To docs on Writing Data](/docs/writing_data)
+to see more examples.
+## Operation Types
+### UPSERT
+This is the default operation where the input records are first tagged as
inserts or updates by looking up the index.
+The records are ultimately written after heuristics are run to determine how
best to pack them on storage to optimize for things like file sizing.
+This operation is recommended for use-cases like database change capture where
the input almost certainly contains updates. The target table will never show
duplicates.
-- **UPSERT** : This is the default operation where the input records are first
tagged as inserts or updates by looking up the index.
- The records are ultimately written after heuristics are run to determine how
best to pack them on storage to optimize for things like file sizing.
- This operation is recommended for use-cases like database change capture
where the input almost certainly contains updates. The target table will never
show duplicates.
-- **INSERT** : This operation is very similar to upsert in terms of
heuristics/file sizing but completely skips the index lookup step. Thus, it can
be a lot faster than upserts
- for use-cases like log de-duplication (in conjunction with options to filter
duplicates mentioned below). This is also suitable for use-cases where the
table can tolerate duplicates, but just
- need the transactional writes/incremental pull/storage management
capabilities of Hudi.
-- **BULK_INSERT** : Both upsert and insert operations keep input records in
memory to speed up storage heuristics computations faster (among other things)
and thus can be cumbersome for
- initial loading/bootstrapping a Hudi table at first. Bulk insert provides
the same semantics as insert, while implementing a sort-based data writing
algorithm, which can scale very well for several hundred TBs
- of initial load. However, this just does a best-effort job at sizing files
vs guaranteeing file sizes like inserts/upserts do.
+### INSERT
+This operation is very similar to upsert in terms of heuristics/file sizing
but completely skips the index lookup step. Thus, it can be a lot faster than
upserts
+for use-cases like log de-duplication (in conjunction with options to filter
duplicates mentioned below). This is also suitable for use-cases where the
table can tolerate duplicates, but just
+need the transactional writes/incremental pull/storage management capabilities
of Hudi.
+
+### BULK_INSERT
+Both upsert and insert operations keep input records in memory to speed up
storage heuristics computations faster (among other things) and thus can be
cumbersome for
+initial loading/bootstrapping a Hudi table at first. Bulk insert provides the
same semantics as insert, while implementing a sort-based data writing
algorithm, which can scale very well for several hundred TBs
+of initial load. However, this just does a best-effort job at sizing files vs
guaranteeing file sizes like inserts/upserts do.
+
+### DELETE
+Hudi supports implementing two types of deletes on data stored in Hudi tables,
by enabling the user to specify a different record payload implementation.
+- **Soft Deletes** : Retain the record key and just null out the values for
all the other fields.
+ This can be achieved by ensuring the appropriate fields are nullable in the
table schema and simply upserting the table after setting these fields to null.
+- **Hard Deletes** : A stronger form of deletion is to physically remove any
trace of the record from the table. This can be achieved in 3 different ways.
+ - Using DataSource, set `OPERATION_OPT_KEY` to `DELETE_OPERATION_OPT_VAL`.
This will remove all the records in the DataSet being submitted.
+ - Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records
in the DataSet being submitted.
+ - Using DataSource or DeltaStreamer, add a column named `_hoodie_is_deleted`
to DataSet. The value of this column must be set to `true` for all the records
to be deleted and either `false` or left null for any records which are to be
upserted.
+
+## Writing path
+The following is an inside look on the Hudi write path and the sequence of
events that occur during a write.
+
+1. [Deduping](/docs/configurations/#writeinsertdeduplicate)
+ 1. First your input records may have duplicate keys within the same batch
and duplicates need to be combined or reduced by key.
+2. [Index Lookup](/docs/next/indexing)
+ 1. Next, an index lookup is performed to try and match the input records to
identify which file groups they belong to.
+3. [File Sizing](/docs/next/file_sizing)
+ 1. Then, based on the average size of previous commits, Hudi will make a
plan to add enough records to a small file to get it close to the configured
maximum limit.
+4. [Partitioning](/docs/next/file_layouts)
+ 1. We now arrive at partitioning where we decide what file groups certain
updates and inserts will be placed in or if new file groups will be created
+5. Write I/O
+ 1. Now we actually do the write operations which is either creating a new
base file, appending to the log file,
+ or versioning an existing base file.
+6. Update [Index](/docs/next/indexing)
+ 1. Now that the write is performed, we will go back and update the index.
+7. Commit
+ 1. Finally we commit all of these changes atomically. (A [callback
notification](/docs/next/writing_data#commit-notifications) is exposed)
+8. [Clean](/docs/next/hoodie_cleaner) (if needed)
+ 1. Following the commit, cleaning is invoked if needed.
+9. [Compaction](/docs/next/compaction)
+ 1. If you are using MOR tables, compaction will either run inline, or be
scheduled asynchronously
+10. Archive
+ 1. Lastly, we perform an archival step which moves old
[timeline](/docs/next/timeline) items to an archive folder.
diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md
index a049a8a..15fcc4d 100644
--- a/website/docs/writing_data.md
+++ b/website/docs/writing_data.md
@@ -297,22 +297,101 @@ For more info refer to [Delete support in
Hudi](https://cwiki.apache.org/conflue
- **Soft Deletes** : Retain the record key and just null out the values for
all the other fields.
This can be achieved by ensuring the appropriate fields are nullable in the
table schema and simply upserting the table after setting these fields to null.
-- **Hard Deletes** : A stronger form of deletion is to physically remove any
trace of the record from the table. This can be achieved in 3 different ways.
+- **Hard Deletes** : A stronger form of deletion is to physically remove any
trace of the record from the table. This can be achieved in 3 different ways.
- 1) Using DataSource, set `OPERATION_OPT_KEY` to
`DELETE_OPERATION_OPT_VAL`. This will remove all the records in the DataSet
being submitted.
+1. Using Datasource, set `OPERATION_OPT_KEY` to `DELETE_OPERATION_OPT_VAL`.
This will remove all the records in the DataSet being submitted.
- 2) Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records
in the DataSet being submitted.
+Example, first read in a dataset:
+```scala
+val roViewDF = spark.
+ read.
+ format("org.apache.hudi").
+ load(basePath + "/*/*/*/*")
+roViewDF.createOrReplaceTempView("hudi_ro_table")
+spark.sql("select count(*) from hudi_ro_table").show() // should return 10
(number of records inserted above)
+val riderValue = spark.sql("select distinct rider from hudi_ro_table").show()
+// copy the value displayed to be used in next step
+```
+Now write a query of which records you would like to delete:
+```scala
+val df = spark.sql("select uuid, partitionPath from hudi_ro_table where rider
= 'rider-213'")
+```
+Lastly, execute the deletion of these records:
+```scala
+val deletes = dataGen.generateDeletes(df.collectAsList())
+val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2));
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(OPERATION_OPT_KEY,"delete").
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Append).
+save(basePath);
+```
- 3) Using DataSource or DeltaStreamer, add a column named
`_hoodie_is_deleted` to DataSet. The value of this column must be set to `true`
for all the records to be deleted and either `false` or left null for any
records which are to be upserted.
+2. Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records
in the DataSet being submitted.
-Example using hard delete method 2, remove all the records from the table that
exist in the DataSet `deleteDF`:
-```java
+This example will remove all the records from the table that exist in the
DataSet `deleteDF`:
+```scala
deleteDF // dataframe containing just records to be deleted
.write().format("org.apache.hudi")
.option(...) // Add HUDI options like record-key, partition-path and others
as needed for your setup
// specify record_key, partition_key, precombine_fieldkey & usual params
.option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
"org.apache.hudi.EmptyHoodieRecordPayload")
-
+```
+
+3. Using DataSource or DeltaStreamer, add a column named `_hoodie_is_deleted`
to DataSet. The value of this column must be set to `true` for all the records
to be deleted and either `false` or left null for any records which are to be
upserted.
+
+Let's say the original schema is:
+```json
+{
+ "type":"record",
+ "name":"example_tbl",
+ "fields":[{
+ "name": "uuid",
+ "type": "String"
+ }, {
+ "name": "ts",
+ "type": "string"
+ }, {
+ "name": "partitionPath",
+ "type": "string"
+ }, {
+ "name": "rank",
+ "type": "long"
+ }
+]}
+```
+Make sure you add `_hoodie_is_deleted` column:
+```json
+{
+ "type":"record",
+ "name":"example_tbl",
+ "fields":[{
+ "name": "uuid",
+ "type": "String"
+ }, {
+ "name": "ts",
+ "type": "string"
+ }, {
+ "name": "partitionPath",
+ "type": "string"
+ }, {
+ "name": "rank",
+ "type": "long"
+ }, {
+ "name" : "_hoodie_is_deleted",
+ "type" : "boolean",
+ "default" : false
+ }
+]}
+```
+
+Then any record you want to delete you can mark `_hoodie_is_deleted` as true:
+```json
+{"ts": 0.0, "uuid": "19tdb048-c93e-4532-adf9-f61ce6afe10", "rank": 1045,
"partitionpath": "americas/brazil/sao_paulo", "_hoodie_is_deleted" : true}
```
### Concurrency Control