This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 30bca65 Update writing_data for operations/deletes (#794)
30bca65 is described below
commit 30bca65de31b5f8ba7d724853ea03348cc99c7fb
Author: vinoth chandar <[email protected]>
AuthorDate: Wed Jul 17 06:01:48 2019 -0700
Update writing_data for operations/deletes (#794)
- provided guidance for upsert vs insert vs bulk_insert
- provided guidance for soft deletes vs hard deletes
---
docs/writing_data.md | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/docs/writing_data.md b/docs/writing_data.md
index 9f5eb2b..4092c69 100644
--- a/docs/writing_data.md
+++ b/docs/writing_data.md
@@ -10,6 +10,24 @@ summary: In this page, we will discuss some available tools
for incrementally in
In this section, we will cover ways to ingest new changes from external
sources or even other Hudi datasets using the [DeltaStreamer](#deltastreamer)
tool, as well as
speeding up large Spark jobs via upserts using the [Hudi
datasource](#datasource-writer). Such datasets can then be
[queried](querying_data.html) using various query engines.
+
+## Write Operations
+
+Before that, it may be helpful to understand the 3 different write operations
provided by Hudi datasource or the delta streamer tool and how best to leverage
them. These operations
+can be chosen/changed across each commit/deltacommit issued against the
dataset.
+
+
+ - **UPSERT** : This is the default operation where the input records are
first tagged as inserts or updates by looking up the index and
+ the records are ultimately written after heuristics are run to determine how
best to pack them on storage to optimize for things like file sizing.
+ This operation is recommended for use-cases like database change capture
where the input almost certainly contains updates.
+ - **INSERT** : This operation is very similar to upsert in terms of
heuristics/file sizing but completely skips the index lookup step. Thus, it can
be a lot faster than upserts
+ for use-cases like log de-duplication (in conjunction with options to filter
duplicates mentioned below). This is also suitable for use-cases where the
dataset can tolerate duplicates, but just
+ need the transactional writes/incremental pull/storage management
capabilities of Hudi.
+ - **BULK_INSERT** : Both upsert and insert operations keep input records in
memory to speed up storage heuristics computations faster (among other things)
and thus can be cumbersome for
+ initial loading/bootstrapping a Hudi dataset at first. Bulk insert provides
the same semantics as insert, while implementing a sort-based data writing
algorithm, which can scale very well for several hundred TBs
+ of initial load. However, this just does a best-effort job at sizing files vs
guaranteeing file sizes like inserts/upserts do.
+
+
## DeltaStreamer
The `HoodieDeltaStreamer` utility (part of hoodie-utilities-bundle) provides
the way to ingest from different sources such as DFS or Kafka, with the
following capabilities.
@@ -164,6 +182,23 @@ Usage: <main class> [options]
Hive username
```
+## Deletes
+
+Hudi supports implementing two types of deletes on data stored in Hudi
datasets, by enabling the user to specify a different record payload
implementation.
+
+ - **Soft Deletes** : With soft deletes, user wants to retain the key but just
null out the values for all other fields.
+ This can be simply achieved by ensuring the appropriate fields are nullable
in the dataset schema and simply upserting the dataset after setting these
fields to null.
+ - **Hard Deletes** : A stronger form of delete is to physically remove any
trace of the record from the dataset. This can be achieved by issuing an upsert
with a custom payload implementation
+ via either DataSource or DeltaStreamer which always returns Optional.Empty as
the combined value. Hudi ships with a built-in
`com.uber.hoodie.EmptyHoodieRecordPayload` class that does exactly this.
+
+ ```$java
+ writer // writer created from dataframe containing just records to be deleted
+ // specify record_key, partition_key, precombine_fieldkey & usual params
+ .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
"com.uber.hoodie.EmptyHoodieRecordPayload")
+
+ ```
+
+
## Storage Management
Hudi also performs several key storage management functions on the data stored
in a Hudi dataset. A key aspect of storing data on DFS is managing file sizes
and counts