This is an automated email from the ASF dual-hosted git repository.
sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 63b0f757a0 [DOCS] Adding FAQ on why hudi has record key requirement
(#5839)
63b0f757a0 is described below
commit 63b0f757a00c7b4669fbfc5a5c5a123d7b9f2942
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Fri Jun 10 18:54:21 2022 -0400
[DOCS] Adding FAQ on why hudi has record key requirement (#5839)
---
website/docs/faq.md | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/website/docs/faq.md b/website/docs/faq.md
index be340e3b38..52137d1be5 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -67,6 +67,17 @@ When writing data into Hudi, you model the records like how
you would on a key-v
When querying/reading data, Hudi just presents itself as a json-like
hierarchical table, everyone is used to querying using Hive/Spark/Presto over
Parquet/Json/Avro.
+### Why does Hudi require a key field to be configured?
+Hudi was designed to support fast record level Upserts and thus requires a key
to identify whether an incoming record is
+an insert or update or delete, and process accordingly. Additionally, Hudi
automatically maintains indexes on this primary
+key and for many use-cases like CDC, ensuring such primary key constraints is
crucial to ensure data quality. In this context,
+pre combine key helps reconcile multiple records with same key in a single
batch of input records. Even for append-only data
+streams, Hudi supports key based de-duplication before inserting records. For
e-g; you may have atleast once data integration
+systems like Kafka MirrorMaker that can introduce duplicates during failures.
Even for plain old batch pipelines, keys
+help eliminate duplication that could be caused by backfill pipelines, where
commonly it's unclear what set of records
+need to be re-written. We are actively working on making keys easier by only
requiring them for Upsert and/or automatically
+generate the key internally (much like RDBMS row_ids)
+
### Does Hudi support cloud storage/object stores?
Yes. Generally speaking, Hudi is able to provide its functionality on any
Hadoop FileSystem implementation and thus can read and write datasets on [Cloud
stores](https://hudi.apache.org/docs/cloud) (Amazon S3 or Microsoft Azure or
Google Cloud Storage). Over time, Hudi has also incorporated specific design
aspects that make building Hudi datasets on the cloud easy, such as
[consistency checks for
s3](https://hudi.apache.org/docs/configurations#hoodieconsistencycheckenabled),
Zero moves/r [...]