This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new ca1d36b71d90 chore(site): update FAQ ref links and ks3 fs (#14124)
ca1d36b71d90 is described below
commit ca1d36b71d905deb35570c6ce57f85498501ef85
Author: Shiyan Xu <[email protected]>
AuthorDate: Tue Oct 21 21:27:46 2025 -0500
chore(site): update FAQ ref links and ks3 fs (#14124)
---
website/docs/cloud.md | 4 +++-
docs/_docs/3_3_ks3_filesystem.md => website/docs/ks3_hoodie.md | 3 +--
website/docs/quick-start-guide.md | 2 +-
website/docs/record_merger.md | 2 +-
website/docs/writing_data.md | 4 ++--
website/docusaurus.config.js | 4 ++++
website/sidebars.js | 3 ++-
website/src/pages/faq/design_and_concepts.md | 4 ++--
website/src/pages/faq/general.md | 2 +-
website/src/pages/faq/storage.md | 6 +++---
website/src/pages/faq/table_services.md | 6 +++---
website/src/pages/faq/writing_tables.md | 10 +++++-----
12 files changed, 28 insertions(+), 22 deletions(-)
diff --git a/website/docs/cloud.md b/website/docs/cloud.md
index 123abd5e6bea..a6e0068bb58f 100644
--- a/website/docs/cloud.md
+++ b/website/docs/cloud.md
@@ -29,10 +29,12 @@ to cloud stores.
Configurations required for JuiceFS and Hudi co-operability.
* [Oracle Cloud Infrastructure](oci_hoodie) <br/>
Configurations required for OCI and Hudi co-operability.
+* [KS3 File System](ks3_hoodie) <br/>
+ Configurations required for KS3 FS and Hudi co-operability.
:::note
Many cloud object storage systems like [Amazon
S3](https://docs.aws.amazon.com/s3/) allow you to set
lifecycle policies, such as [S3
Lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html),
to manage objects. One of the policies is related to object expiration. If
your organisation has configured such policies,
then please ensure to exclude (or have a longer expiry period) for Hudi tables.
-:::
\ No newline at end of file
+:::
diff --git a/docs/_docs/3_3_ks3_filesystem.md b/website/docs/ks3_hoodie.md
similarity index 93%
rename from docs/_docs/3_3_ks3_filesystem.md
rename to website/docs/ks3_hoodie.md
index 1445636858fb..c9d6bc9c59ef 100644
--- a/docs/_docs/3_3_ks3_filesystem.md
+++ b/website/docs/ks3_hoodie.md
@@ -1,7 +1,6 @@
---
title: KS3 Filesystem
-keywords: hudi, hive, aws, s3, spark, presto, ks3
-permalink: /docs/ks3_hoodie.html
+keywords: [hudi, hive, aws, s3, spark, presto, ks3]
summary: In this page, we go over how to configure Hudi with KS3 filesystem.
last_modified_at: 2021-08-09T15:59:57-04:00
---
diff --git a/website/docs/quick-start-guide.md
b/website/docs/quick-start-guide.md
index 388482281f6a..9a91fa3e17fa 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -1305,7 +1305,7 @@ transformation support, automatic table services and so
on.
**Structured Streaming** - Hudi supports Spark Structured Streaming reads and
writes as well. Please see
[here](writing_tables_streaming_writes#spark-streaming) for more.
-Check out more information on [modeling data in
Hudi](faq/general#how-do-i-model-the-data-stored-in-hudi) and different ways to
perform [batch writes](/docs/writing_data) and [streaming
writes](writing_tables_streaming_writes).
+Check out more information on [modeling data in
Hudi](/faq/general#how-do-i-model-the-data-stored-in-hudi) and different ways
to perform [batch writes](/docs/writing_data) and [streaming
writes](writing_tables_streaming_writes).
### Dockerized Demo
Even as we showcased the core capabilities, Hudi supports a lot more advanced
functionality that can make it easy
diff --git a/website/docs/record_merger.md b/website/docs/record_merger.md
index 4c47dbde1b6a..773ee8995f15 100644
--- a/website/docs/record_merger.md
+++ b/website/docs/record_merger.md
@@ -249,7 +249,7 @@ Payload class can be specified using the below configs. For
more advanced config
There are also quite a few other implementations. Developers may be interested
in looking at the hierarchy of `HoodieRecordPayload` interface. For
example,
[`MySqlDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java)
and
[`PostgresDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/PostgresDebeziumAvroPayload.java)
provides support for seamlessly applying changes
captured via Debezium for MySQL and PostgresDB.
[`AWSDmsAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java)
provides support for applying changes captured via Amazon Database Migration
Service onto S3.
-For full configurations, go [here](/docs/configurations#RECORD_PAYLOAD) and
please check out [this
FAQ](faq/writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage)
if you want to implement your own custom payloads.
+For full configurations, go [here](/docs/configurations#RECORD_PAYLOAD) and
please check out [this
FAQ](/faq/writing_tables/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage)
if you want to implement your own custom payloads.
## Related Resources
diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md
index 6d3272378e55..2e193f80359c 100644
--- a/website/docs/writing_data.md
+++ b/website/docs/writing_data.md
@@ -83,7 +83,7 @@ df.write.format("hudi").
You can check the data generated under
`/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
(`uuid` in
[schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)),
partition field (`region/country/city`) and combine logic (`ts` in
[schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60))
to ensure trip records are unique within each partition. For more info, refer
to
-[Modeling data stored in
Hudi](faq/general/#how-do-i-model-the-data-stored-in-hudi)
+[Modeling data stored in
Hudi](/faq/general/#how-do-i-model-the-data-stored-in-hudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi
Tables](/docs/hoodie_streaming_ingestion).
Here we are using the default write operation : `upsert`. If you have a
workload without updates, you can also issue
`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/write_operations)
@@ -119,7 +119,7 @@ df.write.format("hudi").
You can check the data generated under
`/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
(`uuid` in
[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)),
partition field (`region/country/city`) and combine logic (`ts` in
[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60))
to ensure trip records are unique within each partition. For more info, refer
to
-[Modeling data stored in
Hudi](faq_general/#how-do-i-model-the-data-stored-in-hudi)
+[Modeling data stored in
Hudi](/faq/general/#how-do-i-model-the-data-stored-in-hudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi
Tables](/docs/hoodie_streaming_ingestion).
Here we are using the default write operation : `upsert`. If you have a
workload without updates, you can also issue
`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/write_operations)
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
index ee85581c0d01..be3e4c01ea77 100644
--- a/website/docusaurus.config.js
+++ b/website/docusaurus.config.js
@@ -402,6 +402,10 @@ module.exports = {
label: "IBM Cloud",
to: "/docs/ibm_cos_hoodie",
},
+ {
+ label: "Oracle Cloud",
+ to: "/docs/oci_hoodie",
+ },
],
},
{
diff --git a/website/sidebars.js b/website/sidebars.js
index 71c209ebdcaa..233e7ff33696 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -135,7 +135,8 @@ module.exports = {
'ibm_cos_hoodie',
'bos_hoodie',
'jfs_hoodie',
- 'oci_hoodie'
+ 'oci_hoodie',
+ 'ks3_hoodie',
],
},
],
diff --git a/website/src/pages/faq/design_and_concepts.md
b/website/src/pages/faq/design_and_concepts.md
index 6da73cefb728..faa062c9c95c 100644
--- a/website/src/pages/faq/design_and_concepts.md
+++ b/website/src/pages/faq/design_and_concepts.md
@@ -7,7 +7,7 @@ keywords: [hudi, writing, reading]
### How does Hudi ensure atomicity?
-Hudi writers atomically move an inflight write operation to a "completed"
state by writing an object/file to the [timeline](timeline) folder, identifying
the write operation with an instant time that denotes the time the action is
deemed to have occurred. This is achieved on the underlying DFS (in the case of
S3/Cloud Storage, by an atomic PUT operation) and can be observed by files of
the pattern `<instant>.<action>.<state>` in Hudi’s timeline.
+Hudi writers atomically move an inflight write operation to a "completed"
state by writing an object/file to the [timeline](/docs/timeline) folder,
identifying the write operation with an instant time that denotes the time the
action is deemed to have occurred. This is achieved on the underlying DFS (in
the case of S3/Cloud Storage, by an atomic PUT operation) and can be observed
by files of the pattern `<instant>.<action>.<state>` in Hudi’s timeline.
### Does Hudi extend the Hive table layout?
@@ -49,7 +49,7 @@ To expand more on the long term approach, Hudi has had a
proposal to streamline/
This has been delayed for a few reasons
- Large hosted query engines and users not upgrading fast enough.
-- The issues brought up -
\[[1](faq/design_and_concepts#does-hudis-use-of-wall-clock-timestamp-for-instants-pose-any-clock-skew-issues),[2](faq/design_and_concepts#hudis-commits-are-based-on-transaction-start-time-instead-of-completed-time-does-this-cause-data-loss-or-inconsistency-in-case-of-incremental-and-time-travel-queries)\],
+- The issues brought up -
\[[1](/faq/design_and_concepts#does-hudis-use-of-wall-clock-timestamp-for-instants-pose-any-clock-skew-issues),[2](/faq/design_and_concepts#hudis-commits-are-based-on-transaction-start-time-instead-of-completed-time-does-this-cause-data-loss-or-inconsistency-in-case-of-incremental-and-time-travel-queries)\],
relevant to this are not practically very important to users beyond good
pedantic discussions,
- Wanting to do it alongside [non-blocking concurrency
control](https://github.com/apache/hudi/pull/7907) in Hudi version 1.x.
diff --git a/website/src/pages/faq/general.md b/website/src/pages/faq/general.md
index 80005a23125b..c55addbad68f 100644
--- a/website/src/pages/faq/general.md
+++ b/website/src/pages/faq/general.md
@@ -62,7 +62,7 @@ Nonetheless, Hudi is designed very much like a database and
provides similar fun
### How do I model the data stored in Hudi?
-When writing data into Hudi, you model the records like how you would on a
key-value store - specify a key field (unique for a single partition/across
table), a partition field (denotes partition to place key into) and
preCombine/combine logic that specifies how to handle duplicates in a batch of
records written. This model enables Hudi to enforce primary key constraints
like you would get on a database table. See [here](writing_data) for an example.
+When writing data into Hudi, you model the records like how you would on a
key-value store - specify a key field (unique for a single partition/across
table), a partition field (denotes partition to place key into) and
preCombine/combine logic that specifies how to handle duplicates in a batch of
records written. This model enables Hudi to enforce primary key constraints
like you would get on a database table. See [here](/docs/writing_data) for an
example.
When querying/reading data, Hudi just presents itself as a json-like
hierarchical table, everyone is used to querying using Hive/Spark/Presto over
Parquet/Json/Avro.
diff --git a/website/src/pages/faq/storage.md b/website/src/pages/faq/storage.md
index 77119450539e..a266405a489e 100644
--- a/website/src/pages/faq/storage.md
+++ b/website/src/pages/faq/storage.md
@@ -19,7 +19,7 @@ More details can be found [here](/docs/concepts/) and also
[Design And Architect
### How do I migrate my data to Hudi?
-Hudi provides built in support for rewriting your entire table into Hudi
one-time using the HDFSParquetImporter tool available from the hudi-cli . You
could also do this via a simple read and write of the dataset using the Spark
datasource APIs. Once migrated, writes can be performed using normal means
discussed [here](faq/writing_tables#what-are-some-ways-to-write-a-hudi-table).
This topic is discussed in detail [here](/docs/migration_guide/), including
ways to doing partial migrations.
+Hudi provides built in support for rewriting your entire table into Hudi
one-time using the HDFSParquetImporter tool available from the hudi-cli . You
could also do this via a simple read and write of the dataset using the Spark
datasource APIs. Once migrated, writes can be performed using normal means
discussed [here](/faq/writing_tables#what-are-some-ways-to-write-a-hudi-table).
This topic is discussed in detail [here](/docs/migration_guide/), including
ways to doing partial migrations.
### How to convert an existing COW table to MOR?
@@ -170,13 +170,13 @@ After first write:
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key |
_hoodie_partition_path | _hoodie_file_name | Url | ts | uuid |
| ---| ---| ---| ---| ---| ---| ---| --- |
-| 20220622204044318 | 20220622204044318... | 1 | | 890aafc0-d897-44d... |
[hudi.apache.com](http://hudi.apache.com) | 1 | 1 |
+| 20220622204044318 | 20220622204044318... | 1 | | 890aafc0-d897-44d... |
hudi.apache.org | 1 | 1 |
After the second write:
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key |
_hoodie_partition_path | _hoodie_file_name | Url | ts | uuid |
| ---| ---| ---| ---| ---| ---| ---| --- |
-| 20220622204044318 | 20220622204044318... | 1 | | 890aafc0-d897-44d... |
[hudi.apache.com](http://hudi.apache.com) | 1 | 1 |
+| 20220622204044318 | 20220622204044318... | 1 | | 890aafc0-d897-44d... |
hudi.apache.org | 1 | 1 |
| 20220622204208997 | 20220622204208997... | 2 | | 890aafc0-d897-44d... |
null | 1 | 2 |
### Can I change keygenerator for an existing table?
diff --git a/website/src/pages/faq/table_services.md
b/website/src/pages/faq/table_services.md
index 0c6db085f740..60d870fce3b2 100644
--- a/website/src/pages/faq/table_services.md
+++ b/website/src/pages/faq/table_services.md
@@ -36,8 +36,8 @@ Depending on how you write to Hudi these are the possible
options currently.
* Please note it is not possible to disable async compaction for MOR table
with spark structured streaming.
* Flink:
* Async compaction is enabled by default for Merge-On-Read table.
- * Offline compaction can be achieved by setting `compaction.async.enabled`
to `false` and periodically running [Flink offline
Compactor](compaction/#flink-offline-compaction). When running the offline
compactor, one needs to ensure there are no active writes to the table.
- * Third option (highly recommended over the second one) is to schedule the
compactions from the regular ingestion job and executing the compaction plans
from an offline job. To achieve this set `compaction.async.enabled` to `false`,
`compaction.schedule.enabled` to `true` and then run the [Flink offline
Compactor](compaction/#flink-offline-compaction) periodically to execute the
plans.
+ * Offline compaction can be achieved by setting `compaction.async.enabled`
to `false` and periodically running [Flink offline
Compactor](/docs/compaction/#flink-offline-compaction). When running the
offline compactor, one needs to ensure there are no active writes to the table.
+ * Third option (highly recommended over the second one) is to schedule the
compactions from the regular ingestion job and executing the compaction plans
from an offline job. To achieve this set `compaction.async.enabled` to `false`,
`compaction.schedule.enabled` to `true` and then run the [Flink offline
Compactor](/docs/compaction/#flink-offline-compaction) periodically to execute
the plans.
### How to disable all table services in case of multiple writers?
@@ -51,6 +51,6 @@ Hudi runs cleaner to remove old file versions as part of
writing data either in
Yes. Hudi provides the ability to post a callback notification about a write
commit. You can use a http hook or choose to
-be notified via a Kafka/pulsar topic or plug in your own implementation to get
notified. Please refer [here](platform_services_post_commit_callback)
+be notified via a Kafka/pulsar topic or plug in your own implementation to get
notified. Please refer [here](/docs/platform_services_post_commit_callback)
for details
diff --git a/website/src/pages/faq/writing_tables.md
b/website/src/pages/faq/writing_tables.md
index c2c30abeb807..534ba34eb24f 100644
--- a/website/src/pages/faq/writing_tables.md
+++ b/website/src/pages/faq/writing_tables.md
@@ -7,7 +7,7 @@ keywords: [hudi, writing, reading]
### What are some ways to write a Hudi table?
-Typically, you obtain a set of partial updates/inserts from your source and
issue [write operations](/docs/write_operations/) against a Hudi table. If you
ingesting data from any of the standard sources like Kafka, or tailing DFS, the
[delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is
invaluable and provides an easy, self-managed solution to getting data written
into Hudi. You can also write your own code to capture data from a custom
source using the Spark datasour [...]
+Typically, you obtain a set of partial updates/inserts from your source and
issue [write operations](/docs/write_operations/) against a Hudi table. If you
ingesting data from any of the standard sources like Kafka, or tailing DFS, the
[delta streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) tool is
invaluable and provides an easy, self-managed solution to getting data written
into Hudi. You can also write your own code to capture data from a custom
source using the Spark datasour [...]
### How is a Hudi writer job deployed?
@@ -69,15 +69,15 @@ As you could see, ([combineAndGetUpdateValue(),
getInsertValue()](https://github
### How do I delete records in the dataset using Hudi?
-GDPR has made deletes a must-have tool in everyone's data management toolbox.
Hudi supports both soft and hard deletes. For details on how to actually
perform them, see [here](writing_data#deletes).
+GDPR has made deletes a must-have tool in everyone's data management toolbox.
Hudi supports both soft and hard deletes. For details on how to actually
perform them, see [here](/docs/writing_data#deletes).
### Should I need to worry about deleting all copies of the records in case of
duplicates?
-No. Hudi removes all the copies of a record key when deletes are issued. Here
is the long form explanation - Sometimes accidental user errors can lead to
duplicates introduced into a Hudi table by either [concurrent
inserts](faq/writing_tables#can-concurrent-inserts-cause-duplicates) or by [not
deduping the input
records](faq/writing_tables#can-single-writer-inserts-have-duplicates) for an
insert operation. However, using the right index (e.g., in the default [Simple
Index](https://githu [...]
+No. Hudi removes all the copies of a record key when deletes are issued. Here
is the long form explanation - Sometimes accidental user errors can lead to
duplicates introduced into a Hudi table by either [concurrent
inserts](/faq/writing_tables#can-concurrent-inserts-cause-duplicates) or by
[not deduping the input
records](/faq/writing_tables#can-single-writer-inserts-have-duplicates) for an
insert operation. However, using the right index (e.g., in the default [Simple
Index](https://git [...]
### How does Hudi handle duplicate record keys in an input?
-When issuing an `upsert` operation on a table and the batch of records
provided contains multiple entries for a given key, then all of them are
reduced into a single final value by repeatedly calling payload class's
[preCombine()](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L40)
method . By default, we pick the record with the greatest value (determined by
calling .compareTo() [...]
+When issuing an `upsert` operation on a table and the batch of records
provided contains multiple entries for a given key, then all of them are
reduced into a single final value by repeatedly calling payload class's
[preCombine()](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L40)
method . By default, we pick the record with the greatest value (determined by
calling .compareTo() [...]
For an insert or bulk_insert operation, no such pre-combining is performed.
Thus, if your input contains duplicates, the table would also contain
duplicates. If you don't want duplicate records either issue an **upsert** or
consider specifying option to de-duplicate input in either datasource using
[`hoodie.datasource.write.insert.drop.duplicates`](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates)
& [`hoodie.combine.before.insert`](/docs/configurations/#hoodiecombinebeforei
[...]
@@ -184,7 +184,7 @@ No, Hudi does not expose uncommitted files/blocks to the
readers. Further, Hudi
### How are conflicts detected in Hudi between multiple writers?
-Hudi employs [optimistic concurrency control](concurrency_control) between
writers, while implementing MVCC based concurrency control between writers and
the table services. Concurrent writers to the same table need to be configured
with the same lock provider configuration, to safely perform writes. By default
(implemented in
“[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/cli
[...]
+Hudi employs [optimistic concurrency control](/docs/concurrency_control)
between writers, while implementing MVCC based concurrency control between
writers and the table services. Concurrent writers to the same table need to be
configured with the same lock provider configuration, to safely perform writes.
By default (implemented in
“[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hu
[...]
### Can single-writer inserts have duplicates?