This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new cdfb881e643f docs: fix broken links in Hudi website since 0.14.0
(#14192)
cdfb881e643f is described below
commit cdfb881e643f14ba0168efa58640bea10d04c3fe
Author: deepakpanda93 <[email protected]>
AuthorDate: Thu Nov 6 00:58:29 2025 +0530
docs: fix broken links in Hudi website since 0.14.0 (#14192)
---
website/docs/comparison.md | 2 +-
website/docs/configurations.md | 2 +-
website/docs/hudi_stack.md | 2 +-
website/docs/metadata.md | 2 +-
website/docs/overview.mdx | 2 +-
website/docs/structure.md | 2 +-
website/docs/syncing_datahub.md | 2 +-
website/docs/troubleshooting.md | 2 +-
website/docs/tuning-guide.md | 2 +-
.../versioned_docs/version-0.14.0/compaction.md | 2 +-
.../versioned_docs/version-0.14.0/comparison.md | 2 +-
.../version-0.14.0/configurations.md | 2 +-
website/versioned_docs/version-0.14.0/faq.md | 2 +-
website/versioned_docs/version-0.14.0/metadata.md | 2 +-
website/versioned_docs/version-0.14.0/overview.mdx | 2 +-
website/versioned_docs/version-0.14.0/s3_hoodie.md | 2 +-
.../version-0.14.0/schema_evolution.md | 2 +-
.../versioned_docs/version-0.14.0/sql_queries.md | 2 +-
website/versioned_docs/version-0.14.0/structure.md | 2 +-
.../version-0.14.0/syncing_datahub.md | 2 +-
.../version-0.14.0/troubleshooting.md | 2 +-
.../versioned_docs/version-0.14.0/tuning-guide.md | 2 +-
website/versioned_docs/version-0.14.0/use_cases.md | 2 +-
.../versioned_docs/version-0.14.1/compaction.md | 2 +-
.../versioned_docs/version-0.14.1/comparison.md | 2 +-
.../version-0.14.1/configurations.md | 2 +-
.../versioned_docs/version-0.14.1/faq_storage.md | 2 +-
website/versioned_docs/version-0.14.1/metadata.md | 2 +-
website/versioned_docs/version-0.14.1/overview.mdx | 2 +-
website/versioned_docs/version-0.14.1/s3_hoodie.md | 2 +-
.../versioned_docs/version-0.14.1/sql_queries.md | 2 +-
website/versioned_docs/version-0.14.1/structure.md | 2 +-
.../version-0.14.1/syncing_datahub.md | 2 +-
.../versioned_docs/version-0.14.1/table_types.md | 2 +-
.../version-0.14.1/troubleshooting.md | 2 +-
.../versioned_docs/version-0.14.1/tuning-guide.md | 2 +-
website/versioned_docs/version-0.14.1/use_cases.md | 2 +-
.../versioned_docs/version-0.15.0/compaction.md | 2 +-
.../versioned_docs/version-0.15.0/comparison.md | 2 +-
.../version-0.15.0/configurations.md | 2 +-
.../versioned_docs/version-0.15.0/faq_storage.md | 2 +-
website/versioned_docs/version-0.15.0/metadata.md | 2 +-
website/versioned_docs/version-0.15.0/overview.mdx | 2 +-
.../version-0.15.0/reading_tables_batch_reads.md | 2 +-
website/versioned_docs/version-0.15.0/s3_hoodie.md | 2 +-
.../versioned_docs/version-0.15.0/sql_queries.md | 2 +-
website/versioned_docs/version-0.15.0/structure.md | 2 +-
.../version-0.15.0/syncing_datahub.md | 2 +-
.../versioned_docs/version-0.15.0/table_types.md | 2 +-
.../version-0.15.0/troubleshooting.md | 2 +-
.../versioned_docs/version-0.15.0/tuning-guide.md | 2 +-
website/versioned_docs/version-0.15.0/use_cases.md | 2 +-
website/versioned_docs/version-1.0.0/compaction.md | 2 +-
website/versioned_docs/version-1.0.0/comparison.md | 2 +-
.../versioned_docs/version-1.0.0/configurations.md | 2 +-
.../versioned_docs/version-1.0.0/faq_storage.md | 2 +-
website/versioned_docs/version-1.0.0/hudi_stack.md | 2 +-
website/versioned_docs/version-1.0.0/metadata.md | 2 +-
website/versioned_docs/version-1.0.0/overview.mdx | 2 +-
.../version-1.0.0/reading_tables_batch_reads.md | 2 +-
website/versioned_docs/version-1.0.0/s3_hoodie.md | 2 +-
.../versioned_docs/version-1.0.0/sql_queries.md | 2 +-
website/versioned_docs/version-1.0.0/structure.md | 2 +-
.../version-1.0.0/syncing_datahub.md | 2 +-
.../versioned_docs/version-1.0.0/table_types.md | 2 +-
.../version-1.0.0/troubleshooting.md | 2 +-
.../versioned_docs/version-1.0.0/tuning-guide.md | 2 +-
website/versioned_docs/version-1.0.1/compaction.md | 2 +-
website/versioned_docs/version-1.0.1/comparison.md | 4 ++--
.../versioned_docs/version-1.0.1/configurations.md | 2 +-
.../versioned_docs/version-1.0.1/faq_storage.md | 2 +-
website/versioned_docs/version-1.0.1/hudi_stack.md | 2 +-
website/versioned_docs/version-1.0.1/metadata.md | 2 +-
website/versioned_docs/version-1.0.1/overview.mdx | 2 +-
.../version-1.0.1/reading_tables_batch_reads.md | 2 +-
website/versioned_docs/version-1.0.1/s3_hoodie.md | 2 +-
.../versioned_docs/version-1.0.1/sql_queries.md | 2 +-
website/versioned_docs/version-1.0.1/structure.md | 2 +-
.../version-1.0.1/syncing_datahub.md | 2 +-
.../version-1.0.1/troubleshooting.md | 2 +-
.../versioned_docs/version-1.0.1/tuning-guide.md | 2 +-
website/versioned_docs/version-1.0.2/compaction.md | 2 +-
website/versioned_docs/version-1.0.2/comparison.md | 2 +-
.../versioned_docs/version-1.0.2/configurations.md | 2 +-
.../versioned_docs/version-1.0.2/faq_storage.md | 2 +-
website/versioned_docs/version-1.0.2/hudi_stack.md | 26 +++++++++++-----------
website/versioned_docs/version-1.0.2/metadata.md | 2 +-
website/versioned_docs/version-1.0.2/overview.mdx | 2 +-
website/versioned_docs/version-1.0.2/s3_hoodie.md | 2 +-
.../versioned_docs/version-1.0.2/sql_queries.md | 2 +-
website/versioned_docs/version-1.0.2/structure.md | 2 +-
.../version-1.0.2/syncing_datahub.md | 2 +-
.../versioned_docs/version-1.0.2/table_types.md | 2 +-
.../version-1.0.2/troubleshooting.md | 2 +-
.../versioned_docs/version-1.0.2/tuning-guide.md | 2 +-
95 files changed, 108 insertions(+), 108 deletions(-)
diff --git a/website/docs/comparison.md b/website/docs/comparison.md
index 681b359a4de8..0bcce2ace532 100644
--- a/website/docs/comparison.md
+++ b/website/docs/comparison.md
@@ -52,5 +52,5 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
-to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
+to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/website/docs/configurations.md b/website/docs/configurations.md
index 2e05446ae5a7..022b2b172f23 100644
--- a/website/docs/configurations.md
+++ b/website/docs/configurations.md
@@ -1851,7 +1851,7 @@ These set of configs are used for Hudi Streamer utility
which provides the way t
| [hoodie.streamer.sample.writes.size](#hoodiestreamersamplewritessize)
| 5000 | Number of records to sample
from the first write. To improve the estimation's accuracy, for smaller or more
compressable record size, set the sample size bigger. For bigger or less
compressable record size, set smaller.<br />`Config Param:
SAMPLE_WRITES_SIZE`<br />`Since Version: 0.14.0`
[...]
|
[hoodie.streamer.source.kafka.append.offsets](#hoodiestreamersourcekafkaappendoffsets)
| false | When enabled, appends kafka offset
info like source offset(_hoodie_kafka_source_offset), partition
(_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp)
to the records. By default its disabled and no kafka offsets are added<br
/>`Config Param: KAFKA_APPEND_OFFSETS`
[...]
|
[hoodie.streamer.source.sanitize.invalid.char.mask](#hoodiestreamersourcesanitizeinvalidcharmask)
| __ | Defines the character sequence that replaces
invalid characters in schema field names if
hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.<br
/>`Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK`
[...]
-|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/current/spec.html#names).<br />`Config Param:
SANITIZE_SCHEMA_FIELD_NAMES` [...]
+|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/++version++/specification/#names).<br />`Config
Param: SANITIZE_SCHEMA_FIELD_NAMES` [...]
---
diff --git a/website/docs/hudi_stack.md b/website/docs/hudi_stack.md
index d28231244187..472a1fe374e3 100644
--- a/website/docs/hudi_stack.md
+++ b/website/docs/hudi_stack.md
@@ -57,7 +57,7 @@ File Slices. File groups contain multiple versions of File
Slices and are split
the file-group is uniquely identified by the write that created its base file
or the first log file, which helps order the File Slices.
- **Metadata Table** : Implemented as another merge-on-read Hudi table, the
[metadata table](./metadata) efficiently handles quick updates with low write
amplification.
-It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html#sstables)
based file format for quick, indexed key lookups,
+It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage-engine.html#sstables)
based file format for quick, indexed key lookups,
storing vital information like file paths, column statistics and schema. This
approach streamlines operations by reducing the necessity for expensive cloud
file listings.
Hudi’s approach of recording updates into Log Files is more efficient and
involves low merge overhead than systems like Hive ACID, where merging all
delta records against
diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index 8f3b403112ac..fe8827ebeec5 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -46,7 +46,7 @@ is tracked using internal tables. This approach provides the
following advantage
Following are the different types of metadata currently supported.
-- ***[files
listings](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
+- ***[files
listings](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)***:
Stored as *files* partition in the metadata table. Contains file information
such as file name, size, and active state
for each partition in the data table, along with list of all partitions in
the table. Improves the files listing performance
by avoiding direct storage calls such as *exists, listStatus* and
*listFiles* on the data table.
diff --git a/website/docs/overview.mdx b/website/docs/overview.mdx
index bb8910f9c7ed..1e55d6916f3a 100644
--- a/website/docs/overview.mdx
+++ b/website/docs/overview.mdx
@@ -25,7 +25,7 @@ but it also allows you to create efficient incremental batch
pipelines. Apache H
Hudi’s advanced performance optimizations, make analytical queries/pipelines
faster with any of the popular query engines including, Apache Spark, Flink,
Presto, Trino, Hive, etc.
Read the docs for more [use case descriptions](/docs/use_cases) and check out
[who's using Hudi](/powered-by), to see how some of the
-largest data lakes in the world including
[Uber](https://eng.uber.com/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
+largest data lakes in the world including
[Uber](https://www.uber.com/en-IN/blog/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
[ByteDance](http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance),
[Robinhood](https://s.apache.org/hudi-robinhood-talk) and more are
transforming their production data lakes with Hudi.
diff --git a/website/docs/structure.md b/website/docs/structure.md
index 137520dd2a54..0e15e353c30a 100644
--- a/website/docs/structure.md
+++ b/website/docs/structure.md
@@ -9,7 +9,7 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical tab
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
- * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](http://avro.apache.org/docs/current/mr))
+ * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](https://avro.apache.org/docs/++version++/mapreduce-guide/))
<figure>
<img className="docimage"
src={require("/assets/images/hudi_intro_1.png").default} alt="hudi_intro_1.png"
/>
diff --git a/website/docs/syncing_datahub.md b/website/docs/syncing_datahub.md
index 52b8d1e3e49e..39a4ea624864 100644
--- a/website/docs/syncing_datahub.md
+++ b/website/docs/syncing_datahub.md
@@ -3,7 +3,7 @@ title: DataHub
keywords: [hudi, datahub, sync]
---
-[DataHub](https://datahubproject.io/) is a rich metadata platform that
supports features like data discovery, data
+[DataHub](https://datahub.com/) is a rich metadata platform that supports
features like data discovery, data
obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
diff --git a/website/docs/troubleshooting.md b/website/docs/troubleshooting.md
index 4696694d41d8..47de1002beae 100644
--- a/website/docs/troubleshooting.md
+++ b/website/docs/troubleshooting.md
@@ -40,7 +40,7 @@ You can increase `hoodie.commits.archival.batch` moving
forward to increase the
In addition, you can increase the difference between the 2 watermark
configurations : `hoodie.keep.max.commits` (default : 30)
and `hoodie.keep.min.commits` (default : 20). This way, you can reduce the
number of archive files created and also
at the same time increase the number of metadata archived per archive file.
Note that post 0.7.0 release where we are
-adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)),
+adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)),
the follow up work would involve re-organizing archival metadata so that we
can do periodic compactions to control
file-sizing of these archive files.
diff --git a/website/docs/tuning-guide.md b/website/docs/tuning-guide.md
index 4a1f72f1b05f..107fa6e67c70 100644
--- a/website/docs/tuning-guide.md
+++ b/website/docs/tuning-guide.md
@@ -57,7 +57,7 @@ When upsert large input data, hudi spills part of input data
to disk when reach
### How to tune shuffle parallelism of Hudi jobs ?
-First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typicall [...]
+First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typic [...]
(Spark Application → N Spark Jobs → M Spark Stages → T Spark Tasks) on (E
executors with C cores)
diff --git a/website/versioned_docs/version-0.14.0/compaction.md
b/website/versioned_docs/version-0.14.0/compaction.md
index f7a01c286f03..d9238bce6428 100644
--- a/website/versioned_docs/version-0.14.0/compaction.md
+++ b/website/versioned_docs/version-0.14.0/compaction.md
@@ -13,7 +13,7 @@ not applicable to Copy On Write(COW) tables and only applies
to MOR tables.
### Why MOR tables need compaction?
To understand the significance of compaction in MOR tables, it is helpful to
understand the MOR table layout first. In Hudi,
-data is organized in terms of [file
groups](https://hudi.apache.org/docs/file_layouts/). Each file group in a MOR
table
+data is organized in terms of [file groups](file_layouts). Each file group in
a MOR table
consists of a base file and one or more log files. Typically, during writes,
inserts are stored in the base file, and updates
are appended to log files.
diff --git a/website/versioned_docs/version-0.14.0/comparison.md
b/website/versioned_docs/version-0.14.0/comparison.md
index 681b359a4de8..0bcce2ace532 100644
--- a/website/versioned_docs/version-0.14.0/comparison.md
+++ b/website/versioned_docs/version-0.14.0/comparison.md
@@ -52,5 +52,5 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
-to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
+to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/website/versioned_docs/version-0.14.0/configurations.md
b/website/versioned_docs/version-0.14.0/configurations.md
index 3736351b3f34..447e99ecd68d 100644
--- a/website/versioned_docs/version-0.14.0/configurations.md
+++ b/website/versioned_docs/version-0.14.0/configurations.md
@@ -1578,7 +1578,7 @@ These set of configs are used for Hudi Streamer utility
which provides the way t
| [hoodie.streamer.sample.writes.size](#hoodiestreamersamplewritessize)
| 5000 | Number of records to sample
from the first write. To improve the estimation's accuracy, for smaller or more
compressable record size, set the sample size bigger. For bigger or less
compressable record size, set smaller.<br />`Config Param:
SAMPLE_WRITES_SIZE`<br />`Since Version: 0.14.0`
[...]
|
[hoodie.streamer.source.kafka.append.offsets](#hoodiestreamersourcekafkaappendoffsets)
| false | When enabled, appends kafka offset
info like source offset(_hoodie_kafka_source_offset), partition
(_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp)
to the records. By default its disabled and no kafka offsets are added<br
/>`Config Param: KAFKA_APPEND_OFFSETS`
[...]
|
[hoodie.streamer.source.sanitize.invalid.char.mask](#hoodiestreamersourcesanitizeinvalidcharmask)
| __ | Defines the character sequence that replaces
invalid characters in schema field names if
hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.<br
/>`Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK`
[...]
-|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/current/spec.html#names).<br />`Config Param:
SANITIZE_SCHEMA_FIELD_NAMES` [...]
+|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/++version++/specification/#names).<br />`Config
Param: SANITIZE_SCHEMA_FIELD_NAMES` [...]
---
diff --git a/website/versioned_docs/version-0.14.0/faq.md
b/website/versioned_docs/version-0.14.0/faq.md
index 59984161e6a5..d840935e5c6e 100644
--- a/website/versioned_docs/version-0.14.0/faq.md
+++ b/website/versioned_docs/version-0.14.0/faq.md
@@ -474,7 +474,7 @@ The indexing component is a key part of the Hudi writing
and it maps a given rec
Hudi supports a few options for indexing as below
* _HoodieBloomIndex_ : Uses a bloom filter and ranges information placed in
the footer of parquet/base files (and soon log files as well)
-* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some
cases, it might be necessary instead to do the de-duping/enforce uniqueness
across all partitions and the global bloom index does exactly that. If this i
[...]
+* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://www.uber.com/en-IN/blog/uber-big-data-platform/). However, in
some cases, it might be necessary instead to do the de-duping/enforce
uniqueness across all partitions and the global bloom index does exactly that
[...]
* _HBaseIndex_ : Apache HBase is a key value store, typically found in close
proximity to HDFS. You can also store the index inside HBase, which could be
handy if you are already operating HBase.
* _HoodieSimpleIndex (default)_ : A simple index which reads interested
fields (record key and partition path) from base files and joins with incoming
records to find the tagged location.
* _HoodieGlobalSimpleIndex_ : Global version of Simple Index, where in
uniqueness is on record key across entire table.
diff --git a/website/versioned_docs/version-0.14.0/metadata.md
b/website/versioned_docs/version-0.14.0/metadata.md
index 48a7047409ca..50b99b907286 100644
--- a/website/versioned_docs/version-0.14.0/metadata.md
+++ b/website/versioned_docs/version-0.14.0/metadata.md
@@ -66,7 +66,7 @@ mechanism and is built on the following core principles:
Following are the different indices currently available under the metadata
table.
-- ***[files
index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
+- ***[files
index](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)***:
Stored as *files* partition in the metadata table. Contains file information
such as file name, size, and active state
for each partition in the data table. Improves the files listing performance
by avoiding direct file system calls such
as *exists, listStatus* and *listFiles* on the data table.
diff --git a/website/versioned_docs/version-0.14.0/overview.mdx
b/website/versioned_docs/version-0.14.0/overview.mdx
index 71a84591ce57..ed4e520d0ce8 100644
--- a/website/versioned_docs/version-0.14.0/overview.mdx
+++ b/website/versioned_docs/version-0.14.0/overview.mdx
@@ -20,7 +20,7 @@ and [concurrency](/docs/concurrency_control) all while
keeping your data in open
Not only is Apache Hudi great for streaming workloads, but it also allows you
to create efficient incremental batch pipelines.
Read the docs for more [use case descriptions](/docs/use_cases) and check out
[who's using Hudi](/powered-by), to see how some of the
-largest data lakes in the world including
[Uber](https://eng.uber.com/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
+largest data lakes in the world including
[Uber](https://www.uber.com/en-IN/blog/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
[ByteDance](http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance),
[Robinhood](https://s.apache.org/hudi-robinhood-talk) and more are
transforming their production data lakes with Hudi.
diff --git a/website/versioned_docs/version-0.14.0/s3_hoodie.md
b/website/versioned_docs/version-0.14.0/s3_hoodie.md
index b990add7d4b7..5faad6e62be9 100644
--- a/website/versioned_docs/version-0.14.0/s3_hoodie.md
+++ b/website/versioned_docs/version-0.14.0/s3_hoodie.md
@@ -88,7 +88,7 @@ AWS glue data libraries are needed if AWS glue data is used
## AWS S3 Versioned Bucket
-With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner
utility](https://hudi.apache.orghoodie_cleaner) the number of Delete Markers
increases over time.
+With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner utility](hoodie_cleaner) the number of
Delete Markers increases over time.
It is important to configure the [Lifecycle
Rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
correctly
to clean up these delete markers as the List operation can choke if the number
of delete markers reaches 1000.
We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.14.0/schema_evolution.md
b/website/versioned_docs/version-0.14.0/schema_evolution.md
index 49c91ff02902..4f93e85ddcb7 100755
--- a/website/versioned_docs/version-0.14.0/schema_evolution.md
+++ b/website/versioned_docs/version-0.14.0/schema_evolution.md
@@ -29,7 +29,7 @@ type reconciliations. The following table summarizes the
schema changes compatib
| Add a new complex type field with default (map and array)
| Yes | Yes |
|
| Add a new nullable column and change the ordering of fields
| No | No | Write succeeds but read fails if the write with
evolved schema updated only some of the base files but not all. Currently, Hudi
does not maintain a schema registry with history of changes across base files.
Nevertheless, if the upsert touched all base files then the read will succeed. |
| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col`
| Yes | Yes |
|
-| Promote datatype from `int` to `long` for a field at root level
| Yes | Yes | For other types, Hudi supports promotion as
specified in [Avro schema
resolution](http://avro.apache.org/docs/current/spec#Schema+Resolution).
|
+| Promote datatype from `int` to `long` for a field at root level
| Yes | Yes | For other types, Hudi supports promotion as
specified in [Avro schema
resolution](https://avro.apache.org/docs/++version++/specification/#schema-resolution).
|
| Promote datatype from `int` to `long` for a nested field
| Yes | Yes |
| Promote datatype from `int` to `long` for a complex type (value of map or
array) | Yes | Yes |
|
| Add a new non-nullable column at root level at the end
| No | No | In case of MOR table with Spark data source, write
succeeds but read fails. As a **workaround**, you can make the field nullable.
|
diff --git a/website/versioned_docs/version-0.14.0/sql_queries.md
b/website/versioned_docs/version-0.14.0/sql_queries.md
index b909287ae05f..7aef29612f4b 100644
--- a/website/versioned_docs/version-0.14.0/sql_queries.md
+++ b/website/versioned_docs/version-0.14.0/sql_queries.md
@@ -329,7 +329,7 @@ for more details.
Copy on Write Tables in Hudi version 0.10.0 can be queried via Doris external
tables starting from Doris version 1.1.
Please refer
-to [Doris Hudi
Catalog](https://doris.apache.org/docs/lakehouse/datalake-analytics/hudi/)
+to [Doris Hudi
Catalog](https://doris.apache.org/docs/3.x/lakehouse/catalogs/hudi-catalog)
for more details on the setup.
:::note
diff --git a/website/versioned_docs/version-0.14.0/structure.md
b/website/versioned_docs/version-0.14.0/structure.md
index 137520dd2a54..0e15e353c30a 100644
--- a/website/versioned_docs/version-0.14.0/structure.md
+++ b/website/versioned_docs/version-0.14.0/structure.md
@@ -9,7 +9,7 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical tab
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
- * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](http://avro.apache.org/docs/current/mr))
+ * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](https://avro.apache.org/docs/++version++/mapreduce-guide/))
<figure>
<img className="docimage"
src={require("/assets/images/hudi_intro_1.png").default} alt="hudi_intro_1.png"
/>
diff --git a/website/versioned_docs/version-0.14.0/syncing_datahub.md
b/website/versioned_docs/version-0.14.0/syncing_datahub.md
index 40fcd1d1891e..952249d3ff68 100644
--- a/website/versioned_docs/version-0.14.0/syncing_datahub.md
+++ b/website/versioned_docs/version-0.14.0/syncing_datahub.md
@@ -3,7 +3,7 @@ title: DataHub
keywords: [hudi, datahub, sync]
---
-[DataHub](https://datahubproject.io/) is a rich metadata platform that
supports features like data discovery, data
+[DataHub](https://datahub.com/) is a rich metadata platform that supports
features like data discovery, data
obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
diff --git a/website/versioned_docs/version-0.14.0/troubleshooting.md
b/website/versioned_docs/version-0.14.0/troubleshooting.md
index aaa3f4feb635..13d3f3ac98af 100644
--- a/website/versioned_docs/version-0.14.0/troubleshooting.md
+++ b/website/versioned_docs/version-0.14.0/troubleshooting.md
@@ -40,7 +40,7 @@ You can increase `hoodie.commits.archival.batch` moving
forward to increase the
In addition, you can increase the difference between the 2 watermark
configurations : `hoodie.keep.max.commits` (default : 30)
and `hoodie.keep.min.commits` (default : 20). This way, you can reduce the
number of archive files created and also
at the same time increase the number of metadata archived per archive file.
Note that post 0.7.0 release where we are
-adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)),
+adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)),
the follow up work would involve re-organizing archival metadata so that we
can do periodic compactions to control
file-sizing of these archive files.
diff --git a/website/versioned_docs/version-0.14.0/tuning-guide.md
b/website/versioned_docs/version-0.14.0/tuning-guide.md
index 4eaddce2dbd3..96a64ed78e95 100644
--- a/website/versioned_docs/version-0.14.0/tuning-guide.md
+++ b/website/versioned_docs/version-0.14.0/tuning-guide.md
@@ -57,7 +57,7 @@ When upsert large input data, hudi spills part of input data
to disk when reach
### How to tune shuffle parallelism of Hudi jobs ?
-First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typicall [...]
+First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typic [...]
(Spark Application → N Spark Jobs → M Spark Stages → T Spark Tasks) on (E
executors with C cores)
diff --git a/website/versioned_docs/version-0.14.0/use_cases.md
b/website/versioned_docs/version-0.14.0/use_cases.md
index 893aa653f5e7..e9ccd84ef310 100644
--- a/website/versioned_docs/version-0.14.0/use_cases.md
+++ b/website/versioned_docs/version-0.14.0/use_cases.md
@@ -22,7 +22,7 @@ more value is created.
For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed
costly & inefficient bulk loads. It's very common to use a change capture
solution like
[Debezium](http://debezium.io/) or [Kafka
Connect](https://docs.confluent.io/platform/current/connect/index) or
[Sqoop Incremental
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
and apply them to an
-equivalent Hudi table on DFS. For NoSQL datastores like
[Cassandra](http://cassandra.apache.org/) /
[Voldemort](http://www.project-voldemort.com/voldemort/) /
[HBase](https://hbase.apache.org/),
+equivalent Hudi table on DFS. For NoSQL datastores like
[Cassandra](http://cassandra.apache.org/) / [HBase](https://hbase.apache.org/),
even moderately big installations store billions of rows. It goes without
saying that __full bulk loads are simply infeasible__ and more efficient
approaches
are needed if ingestion is to keep up with the typically high update volumes.
diff --git a/website/versioned_docs/version-0.14.1/compaction.md
b/website/versioned_docs/version-0.14.1/compaction.md
index 5df14e5af971..3ecd95b43853 100644
--- a/website/versioned_docs/version-0.14.1/compaction.md
+++ b/website/versioned_docs/version-0.14.1/compaction.md
@@ -13,7 +13,7 @@ not applicable to Copy On Write(COW) tables and only applies
to MOR tables.
### Why MOR tables need compaction?
To understand the significance of compaction in MOR tables, it is helpful to
understand the MOR table layout first. In Hudi,
-data is organized in terms of [file
groups](https://hudi.apache.org/docs/file_layouts/). Each file group in a MOR
table
+data is organized in terms of [file groups](file_layouts). Each file group in
a MOR table
consists of a base file and one or more log files. Typically, during writes,
inserts are stored in the base file, and updates
are appended to log files.
diff --git a/website/versioned_docs/version-0.14.1/comparison.md
b/website/versioned_docs/version-0.14.1/comparison.md
index 681b359a4de8..0bcce2ace532 100644
--- a/website/versioned_docs/version-0.14.1/comparison.md
+++ b/website/versioned_docs/version-0.14.1/comparison.md
@@ -52,5 +52,5 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
-to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
+to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/website/versioned_docs/version-0.14.1/configurations.md
b/website/versioned_docs/version-0.14.1/configurations.md
index 45d2fd560418..a39c4774aa7b 100644
--- a/website/versioned_docs/version-0.14.1/configurations.md
+++ b/website/versioned_docs/version-0.14.1/configurations.md
@@ -1577,7 +1577,7 @@ These set of configs are used for Hudi Streamer utility
which provides the way t
| [hoodie.streamer.sample.writes.size](#hoodiestreamersamplewritessize)
| 5000 | Number of records to sample
from the first write. To improve the estimation's accuracy, for smaller or more
compressable record size, set the sample size bigger. For bigger or less
compressable record size, set smaller.<br />`Config Param:
SAMPLE_WRITES_SIZE`<br />`Since Version: 0.14.0`
[...]
|
[hoodie.streamer.source.kafka.append.offsets](#hoodiestreamersourcekafkaappendoffsets)
| false | When enabled, appends kafka offset
info like source offset(_hoodie_kafka_source_offset), partition
(_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp)
to the records. By default its disabled and no kafka offsets are added<br
/>`Config Param: KAFKA_APPEND_OFFSETS`
[...]
|
[hoodie.streamer.source.sanitize.invalid.char.mask](#hoodiestreamersourcesanitizeinvalidcharmask)
| __ | Defines the character sequence that replaces
invalid characters in schema field names if
hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.<br
/>`Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK`
[...]
-|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/current/spec.html#names).<br />`Config Param:
SANITIZE_SCHEMA_FIELD_NAMES` [...]
+|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/++version++/specification/#names).<br />`Config
Param: SANITIZE_SCHEMA_FIELD_NAMES` [...]
---
diff --git a/website/versioned_docs/version-0.14.1/faq_storage.md
b/website/versioned_docs/version-0.14.1/faq_storage.md
index 43ca76817a8c..4f7bfd498aeb 100644
--- a/website/versioned_docs/version-0.14.1/faq_storage.md
+++ b/website/versioned_docs/version-0.14.1/faq_storage.md
@@ -47,7 +47,7 @@ The indexing component is a key part of the Hudi writing and
it maps a given rec
Hudi supports a few options for indexing as below
* _HoodieBloomIndex_ : Uses a bloom filter and ranges information placed in
the footer of parquet/base files (and soon log files as well)
-* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some
cases, it might be necessary instead to do the de-duping/enforce uniqueness
across all partitions and the global bloom index does exactly that. If this i
[...]
+* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://www.uber.com/en-IN/blog/uber-big-data-platform/). However, in
some cases, it might be necessary instead to do the de-duping/enforce
uniqueness across all partitions and the global bloom index does exactly that
[...]
* _HBaseIndex_ : Apache HBase is a key value store, typically found in close
proximity to HDFS. You can also store the index inside HBase, which could be
handy if you are already operating HBase.
* _HoodieSimpleIndex (default)_ : A simple index which reads interested
fields (record key and partition path) from base files and joins with incoming
records to find the tagged location.
* _HoodieGlobalSimpleIndex_ : Global version of Simple Index, where in
uniqueness is on record key across entire table.
diff --git a/website/versioned_docs/version-0.14.1/metadata.md
b/website/versioned_docs/version-0.14.1/metadata.md
index 52e4c788275f..df520e8a5564 100644
--- a/website/versioned_docs/version-0.14.1/metadata.md
+++ b/website/versioned_docs/version-0.14.1/metadata.md
@@ -66,7 +66,7 @@ mechanism and is built on the following core principles:
Following are the different indices currently available under the metadata
table.
-- ***[files
index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
+- ***[files
index](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)***:
Stored as *files* partition in the metadata table. Contains file information
such as file name, size, and active state
for each partition in the data table. Improves the files listing performance
by avoiding direct file system calls such
as *exists, listStatus* and *listFiles* on the data table.
diff --git a/website/versioned_docs/version-0.14.1/overview.mdx
b/website/versioned_docs/version-0.14.1/overview.mdx
index e6a288328b63..8123b427d464 100644
--- a/website/versioned_docs/version-0.14.1/overview.mdx
+++ b/website/versioned_docs/version-0.14.1/overview.mdx
@@ -20,7 +20,7 @@ and [concurrency](/docs/next/concurrency_control) all while
keeping your data in
Not only is Apache Hudi great for streaming workloads, but it also allows you
to create efficient incremental batch pipelines.
Read the docs for more [use case descriptions](/docs/use_cases) and check out
[who's using Hudi](/powered-by), to see how some of the
-largest data lakes in the world including
[Uber](https://eng.uber.com/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
+largest data lakes in the world including
[Uber](https://www.uber.com/en-IN/blog/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
[ByteDance](http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance),
[Robinhood](https://s.apache.org/hudi-robinhood-talk) and more are
transforming their production data lakes with Hudi.
diff --git a/website/versioned_docs/version-0.14.1/s3_hoodie.md
b/website/versioned_docs/version-0.14.1/s3_hoodie.md
index b990add7d4b7..5faad6e62be9 100644
--- a/website/versioned_docs/version-0.14.1/s3_hoodie.md
+++ b/website/versioned_docs/version-0.14.1/s3_hoodie.md
@@ -88,7 +88,7 @@ AWS glue data libraries are needed if AWS glue data is used
## AWS S3 Versioned Bucket
-With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner
utility](https://hudi.apache.orghoodie_cleaner) the number of Delete Markers
increases over time.
+With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner utility](hoodie_cleaner) the number of
Delete Markers increases over time.
It is important to configure the [Lifecycle
Rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
correctly
to clean up these delete markers as the List operation can choke if the number
of delete markers reaches 1000.
We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.14.1/sql_queries.md
b/website/versioned_docs/version-0.14.1/sql_queries.md
index 44fbd055289d..a43c2bfcf992 100644
--- a/website/versioned_docs/version-0.14.1/sql_queries.md
+++ b/website/versioned_docs/version-0.14.1/sql_queries.md
@@ -337,7 +337,7 @@ for more details.
Copy on Write Tables in Hudi version 0.10.0 can be queried via Doris external
tables starting from Doris version 1.1.
Please refer
-to [Doris Hudi
Catalog](https://doris.apache.org/docs/lakehouse/datalake-analytics/hudi/)
+to [Doris Hudi
Catalog](https://doris.apache.org/docs/3.x/lakehouse/catalogs/hudi-catalog)
for more details on the setup.
:::note
diff --git a/website/versioned_docs/version-0.14.1/structure.md
b/website/versioned_docs/version-0.14.1/structure.md
index 137520dd2a54..0e15e353c30a 100644
--- a/website/versioned_docs/version-0.14.1/structure.md
+++ b/website/versioned_docs/version-0.14.1/structure.md
@@ -9,7 +9,7 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical tab
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
- * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](http://avro.apache.org/docs/current/mr))
+ * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](https://avro.apache.org/docs/++version++/mapreduce-guide/))
<figure>
<img className="docimage"
src={require("/assets/images/hudi_intro_1.png").default} alt="hudi_intro_1.png"
/>
diff --git a/website/versioned_docs/version-0.14.1/syncing_datahub.md
b/website/versioned_docs/version-0.14.1/syncing_datahub.md
index 40fcd1d1891e..952249d3ff68 100644
--- a/website/versioned_docs/version-0.14.1/syncing_datahub.md
+++ b/website/versioned_docs/version-0.14.1/syncing_datahub.md
@@ -3,7 +3,7 @@ title: DataHub
keywords: [hudi, datahub, sync]
---
-[DataHub](https://datahubproject.io/) is a rich metadata platform that
supports features like data discovery, data
+[DataHub](https://datahub.com/) is a rich metadata platform that supports
features like data discovery, data
obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
diff --git a/website/versioned_docs/version-0.14.1/table_types.md
b/website/versioned_docs/version-0.14.1/table_types.md
index 28814d239e81..2174aae8f7a8 100644
--- a/website/versioned_docs/version-0.14.1/table_types.md
+++ b/website/versioned_docs/version-0.14.1/table_types.md
@@ -149,4 +149,4 @@ Refer
[here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for
* [Comparing Apache Hudi's MOR and COW Tables, Use Cases from
Uber](https://youtu.be/BiTXyzFNHlA)
* [Different table types in Apache Hudi, MOR and COW, Deep
Dive](https://youtu.be/vyEvlt57L-s)
-* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQx)
\ No newline at end of file
+* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQ)
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.14.1/troubleshooting.md
b/website/versioned_docs/version-0.14.1/troubleshooting.md
index aaa3f4feb635..13d3f3ac98af 100644
--- a/website/versioned_docs/version-0.14.1/troubleshooting.md
+++ b/website/versioned_docs/version-0.14.1/troubleshooting.md
@@ -40,7 +40,7 @@ You can increase `hoodie.commits.archival.batch` moving
forward to increase the
In addition, you can increase the difference between the 2 watermark
configurations : `hoodie.keep.max.commits` (default : 30)
and `hoodie.keep.min.commits` (default : 20). This way, you can reduce the
number of archive files created and also
at the same time increase the number of metadata archived per archive file.
Note that post 0.7.0 release where we are
-adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)),
+adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)),
the follow up work would involve re-organizing archival metadata so that we
can do periodic compactions to control
file-sizing of these archive files.
diff --git a/website/versioned_docs/version-0.14.1/tuning-guide.md
b/website/versioned_docs/version-0.14.1/tuning-guide.md
index 4eaddce2dbd3..96a64ed78e95 100644
--- a/website/versioned_docs/version-0.14.1/tuning-guide.md
+++ b/website/versioned_docs/version-0.14.1/tuning-guide.md
@@ -57,7 +57,7 @@ When upsert large input data, hudi spills part of input data
to disk when reach
### How to tune shuffle parallelism of Hudi jobs ?
-First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typicall [...]
+First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typic [...]
(Spark Application → N Spark Jobs → M Spark Stages → T Spark Tasks) on (E
executors with C cores)
diff --git a/website/versioned_docs/version-0.14.1/use_cases.md
b/website/versioned_docs/version-0.14.1/use_cases.md
index 4d06f1e571a6..fb6061b8d1b5 100644
--- a/website/versioned_docs/version-0.14.1/use_cases.md
+++ b/website/versioned_docs/version-0.14.1/use_cases.md
@@ -22,7 +22,7 @@ more value is created.
For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed
costly & inefficient bulk loads. It's very common to use a change capture
solution like
[Debezium](http://debezium.io/) or [Kafka
Connect](https://docs.confluent.io/platform/current/connect/index) or
[Sqoop Incremental
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
and apply them to an
-equivalent Hudi table on DFS. For NoSQL datastores like
[Cassandra](http://cassandra.apache.org/) /
[Voldemort](http://www.project-voldemort.com/voldemort/) /
[HBase](https://hbase.apache.org/),
+equivalent Hudi table on DFS. For NoSQL datastores like
[Cassandra](http://cassandra.apache.org/) / [HBase](https://hbase.apache.org/),
even moderately big installations store billions of rows. It goes without
saying that __full bulk loads are simply infeasible__ and more efficient
approaches
are needed if ingestion is to keep up with the typically high update volumes.
diff --git a/website/versioned_docs/version-0.15.0/compaction.md
b/website/versioned_docs/version-0.15.0/compaction.md
index 54fdfdb54987..1ec5506c3535 100644
--- a/website/versioned_docs/version-0.15.0/compaction.md
+++ b/website/versioned_docs/version-0.15.0/compaction.md
@@ -13,7 +13,7 @@ not applicable to Copy On Write(COW) tables and only applies
to MOR tables.
### Why MOR tables need compaction?
To understand the significance of compaction in MOR tables, it is helpful to
understand the MOR table layout first. In Hudi,
-data is organized in terms of [file
groups](https://hudi.apache.org/docs/file_layouts/). Each file group in a MOR
table
+data is organized in terms of [file groups](file_layouts). Each file group in
a MOR table
consists of a base file and one or more log files. Typically, during writes,
inserts are stored in the base file, and updates
are appended to log files.
diff --git a/website/versioned_docs/version-0.15.0/comparison.md
b/website/versioned_docs/version-0.15.0/comparison.md
index 681b359a4de8..0bcce2ace532 100644
--- a/website/versioned_docs/version-0.15.0/comparison.md
+++ b/website/versioned_docs/version-0.15.0/comparison.md
@@ -52,5 +52,5 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
-to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
+to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/website/versioned_docs/version-0.15.0/configurations.md
b/website/versioned_docs/version-0.15.0/configurations.md
index 19cf11a27c23..d90509a4e147 100644
--- a/website/versioned_docs/version-0.15.0/configurations.md
+++ b/website/versioned_docs/version-0.15.0/configurations.md
@@ -1730,7 +1730,7 @@ These set of configs are used for Hudi Streamer utility
which provides the way t
| [hoodie.streamer.sample.writes.size](#hoodiestreamersamplewritessize)
| 5000 | Number of records to sample
from the first write. To improve the estimation's accuracy, for smaller or more
compressable record size, set the sample size bigger. For bigger or less
compressable record size, set smaller.<br />`Config Param:
SAMPLE_WRITES_SIZE`<br />`Since Version: 0.14.0`
[...]
|
[hoodie.streamer.source.kafka.append.offsets](#hoodiestreamersourcekafkaappendoffsets)
| false | When enabled, appends kafka offset
info like source offset(_hoodie_kafka_source_offset), partition
(_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp)
to the records. By default its disabled and no kafka offsets are added<br
/>`Config Param: KAFKA_APPEND_OFFSETS`
[...]
|
[hoodie.streamer.source.sanitize.invalid.char.mask](#hoodiestreamersourcesanitizeinvalidcharmask)
| __ | Defines the character sequence that replaces
invalid characters in schema field names if
hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.<br
/>`Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK`
[...]
-|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/current/spec.html#names).<br />`Config Param:
SANITIZE_SCHEMA_FIELD_NAMES` [...]
+|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/++version++/specification/#names).<br />`Config
Param: SANITIZE_SCHEMA_FIELD_NAMES` [...]
---
diff --git a/website/versioned_docs/version-0.15.0/faq_storage.md
b/website/versioned_docs/version-0.15.0/faq_storage.md
index 359c7764da61..c9456670ecdc 100644
--- a/website/versioned_docs/version-0.15.0/faq_storage.md
+++ b/website/versioned_docs/version-0.15.0/faq_storage.md
@@ -47,7 +47,7 @@ The indexing component is a key part of the Hudi writing and
it maps a given rec
Hudi supports a few options for indexing as below
* _HoodieBloomIndex_ : Uses a bloom filter and ranges information placed in
the footer of parquet/base files (and soon log files as well)
-* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some
cases, it might be necessary instead to do the de-duping/enforce uniqueness
across all partitions and the global bloom index does exactly that. If this i
[...]
+* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://www.uber.com/en-IN/blog/uber-big-data-platform/). However, in
some cases, it might be necessary instead to do the de-duping/enforce
uniqueness across all partitions and the global bloom index does exactly that
[...]
* _HBaseIndex_ : Apache HBase is a key value store, typically found in close
proximity to HDFS. You can also store the index inside HBase, which could be
handy if you are already operating HBase.
* _HoodieSimpleIndex (default)_ : A simple index which reads interested
fields (record key and partition path) from base files and joins with incoming
records to find the tagged location.
* _HoodieGlobalSimpleIndex_ : Global version of Simple Index, where in
uniqueness is on record key across entire table.
diff --git a/website/versioned_docs/version-0.15.0/metadata.md
b/website/versioned_docs/version-0.15.0/metadata.md
index b2b57e62f84e..93580bdc04fe 100644
--- a/website/versioned_docs/version-0.15.0/metadata.md
+++ b/website/versioned_docs/version-0.15.0/metadata.md
@@ -66,7 +66,7 @@ mechanism and is built on the following core principles:
Following are the different indices currently available under the metadata
table.
-- ***[files
index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
+- ***[files
index](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)***:
Stored as *files* partition in the metadata table. Contains file information
such as file name, size, and active state
for each partition in the data table. Improves the files listing performance
by avoiding direct file system calls such
as *exists, listStatus* and *listFiles* on the data table.
diff --git a/website/versioned_docs/version-0.15.0/overview.mdx
b/website/versioned_docs/version-0.15.0/overview.mdx
index 27237f6438d3..6083a9d248fa 100644
--- a/website/versioned_docs/version-0.15.0/overview.mdx
+++ b/website/versioned_docs/version-0.15.0/overview.mdx
@@ -20,7 +20,7 @@ and [concurrency](/docs/next/concurrency_control) all while
keeping your data in
Not only is Apache Hudi great for streaming workloads, but it also allows you
to create efficient incremental batch pipelines.
Read the docs for more [use case descriptions](/docs/use_cases) and check out
[who's using Hudi](/powered-by), to see how some of the
-largest data lakes in the world including
[Uber](https://eng.uber.com/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
+largest data lakes in the world including
[Uber](https://www.uber.com/en-IN/blog/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
[ByteDance](http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance),
[Robinhood](https://s.apache.org/hudi-robinhood-talk) and more are
transforming their production data lakes with Hudi.
diff --git
a/website/versioned_docs/version-0.15.0/reading_tables_batch_reads.md
b/website/versioned_docs/version-0.15.0/reading_tables_batch_reads.md
index d247fd4c3d08..f3ddcd236694 100644
--- a/website/versioned_docs/version-0.15.0/reading_tables_batch_reads.md
+++ b/website/versioned_docs/version-0.15.0/reading_tables_batch_reads.md
@@ -32,4 +32,4 @@ df = df.where(df["foo"] > 5)
df.show()
```
-Check out the Daft docs for [Hudi
integration](https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/hudi.html).
+Check out the Daft docs for [Hudi
integration](https://docs.daft.ai/en/stable/connectors/hudi/).
diff --git a/website/versioned_docs/version-0.15.0/s3_hoodie.md
b/website/versioned_docs/version-0.15.0/s3_hoodie.md
index b990add7d4b7..5faad6e62be9 100644
--- a/website/versioned_docs/version-0.15.0/s3_hoodie.md
+++ b/website/versioned_docs/version-0.15.0/s3_hoodie.md
@@ -88,7 +88,7 @@ AWS glue data libraries are needed if AWS glue data is used
## AWS S3 Versioned Bucket
-With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner
utility](https://hudi.apache.orghoodie_cleaner) the number of Delete Markers
increases over time.
+With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner utility](hoodie_cleaner) the number of
Delete Markers increases over time.
It is important to configure the [Lifecycle
Rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
correctly
to clean up these delete markers as the List operation can choke if the number
of delete markers reaches 1000.
We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.15.0/sql_queries.md
b/website/versioned_docs/version-0.15.0/sql_queries.md
index 998d90d6b553..7d3f96fbb1a4 100644
--- a/website/versioned_docs/version-0.15.0/sql_queries.md
+++ b/website/versioned_docs/version-0.15.0/sql_queries.md
@@ -336,7 +336,7 @@ for more details.
## Doris
The Doris integration currently support Copy on Write and Merge On Read tables
in Hudi since version 0.10.0. You can query Hudi tables via Doris from Doris
version 2.0 Doris offers a multi-catalog, which is designed to make it easier
to connect to external data catalogs to enhance Doris's data lake analysis and
federated data query capabilities. Please refer
-to [Doris Hudi
Catalog](https://doris.apache.org/docs/lakehouse/datalake-analytics/hudi/) for
more details on the setup.
+to [Doris Hudi
Catalog](https://doris.apache.org/docs/3.x/lakehouse/catalogs/hudi-catalog) for
more details on the setup.
:::note
The current default supported version of Hudi is 0.10.0 ~ 0.13.1, and has not
been tested in other versions. More versions will be supported in the future.
diff --git a/website/versioned_docs/version-0.15.0/structure.md
b/website/versioned_docs/version-0.15.0/structure.md
index 137520dd2a54..0e15e353c30a 100644
--- a/website/versioned_docs/version-0.15.0/structure.md
+++ b/website/versioned_docs/version-0.15.0/structure.md
@@ -9,7 +9,7 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical tab
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
- * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](http://avro.apache.org/docs/current/mr))
+ * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](https://avro.apache.org/docs/++version++/mapreduce-guide/))
<figure>
<img className="docimage"
src={require("/assets/images/hudi_intro_1.png").default} alt="hudi_intro_1.png"
/>
diff --git a/website/versioned_docs/version-0.15.0/syncing_datahub.md
b/website/versioned_docs/version-0.15.0/syncing_datahub.md
index 40fcd1d1891e..952249d3ff68 100644
--- a/website/versioned_docs/version-0.15.0/syncing_datahub.md
+++ b/website/versioned_docs/version-0.15.0/syncing_datahub.md
@@ -3,7 +3,7 @@ title: DataHub
keywords: [hudi, datahub, sync]
---
-[DataHub](https://datahubproject.io/) is a rich metadata platform that
supports features like data discovery, data
+[DataHub](https://datahub.com/) is a rich metadata platform that supports
features like data discovery, data
obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
diff --git a/website/versioned_docs/version-0.15.0/table_types.md
b/website/versioned_docs/version-0.15.0/table_types.md
index e280909a9f3b..eb2495894216 100644
--- a/website/versioned_docs/version-0.15.0/table_types.md
+++ b/website/versioned_docs/version-0.15.0/table_types.md
@@ -149,4 +149,4 @@ Refer
[here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for
* [Comparing Apache Hudi's MOR and COW Tables, Use Cases from
Uber](https://youtu.be/BiTXyzFNHlA)
* [Different table types in Apache Hudi, MOR and COW, Deep
Dive](https://youtu.be/vyEvlt57L-s)
-* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQx)
\ No newline at end of file
+* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQ)
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.15.0/troubleshooting.md
b/website/versioned_docs/version-0.15.0/troubleshooting.md
index f16fa458ee7f..6756033995b4 100644
--- a/website/versioned_docs/version-0.15.0/troubleshooting.md
+++ b/website/versioned_docs/version-0.15.0/troubleshooting.md
@@ -40,7 +40,7 @@ You can increase `hoodie.commits.archival.batch` moving
forward to increase the
In addition, you can increase the difference between the 2 watermark
configurations : `hoodie.keep.max.commits` (default : 30)
and `hoodie.keep.min.commits` (default : 20). This way, you can reduce the
number of archive files created and also
at the same time increase the number of metadata archived per archive file.
Note that post 0.7.0 release where we are
-adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)),
+adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)),
the follow up work would involve re-organizing archival metadata so that we
can do periodic compactions to control
file-sizing of these archive files.
diff --git a/website/versioned_docs/version-0.15.0/tuning-guide.md
b/website/versioned_docs/version-0.15.0/tuning-guide.md
index 4a1f72f1b05f..107fa6e67c70 100644
--- a/website/versioned_docs/version-0.15.0/tuning-guide.md
+++ b/website/versioned_docs/version-0.15.0/tuning-guide.md
@@ -57,7 +57,7 @@ When upsert large input data, hudi spills part of input data
to disk when reach
### How to tune shuffle parallelism of Hudi jobs ?
-First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typicall [...]
+First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typic [...]
(Spark Application → N Spark Jobs → M Spark Stages → T Spark Tasks) on (E
executors with C cores)
diff --git a/website/versioned_docs/version-0.15.0/use_cases.md
b/website/versioned_docs/version-0.15.0/use_cases.md
index 4d06f1e571a6..fb6061b8d1b5 100644
--- a/website/versioned_docs/version-0.15.0/use_cases.md
+++ b/website/versioned_docs/version-0.15.0/use_cases.md
@@ -22,7 +22,7 @@ more value is created.
For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed
costly & inefficient bulk loads. It's very common to use a change capture
solution like
[Debezium](http://debezium.io/) or [Kafka
Connect](https://docs.confluent.io/platform/current/connect/index) or
[Sqoop Incremental
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
and apply them to an
-equivalent Hudi table on DFS. For NoSQL datastores like
[Cassandra](http://cassandra.apache.org/) /
[Voldemort](http://www.project-voldemort.com/voldemort/) /
[HBase](https://hbase.apache.org/),
+equivalent Hudi table on DFS. For NoSQL datastores like
[Cassandra](http://cassandra.apache.org/) / [HBase](https://hbase.apache.org/),
even moderately big installations store billions of rows. It goes without
saying that __full bulk loads are simply infeasible__ and more efficient
approaches
are needed if ingestion is to keep up with the typically high update volumes.
diff --git a/website/versioned_docs/version-1.0.0/compaction.md
b/website/versioned_docs/version-1.0.0/compaction.md
index 7859030052aa..941c1d227fce 100644
--- a/website/versioned_docs/version-1.0.0/compaction.md
+++ b/website/versioned_docs/version-1.0.0/compaction.md
@@ -13,7 +13,7 @@ not applicable to Copy On Write(COW) tables and only applies
to MOR tables.
### Why MOR tables need compaction?
To understand the significance of compaction in MOR tables, it is helpful to
understand the MOR table layout first. In Hudi,
-data is organized in terms of [file
groups](https://hudi.apache.org/docs/file_layouts/). Each file group in a MOR
table
+data is organized in terms of [file groups](storage_layouts). Each file group
in a MOR table
consists of a base file and one or more log files. Typically, during writes,
inserts are stored in the base file, and updates
are appended to log files.
diff --git a/website/versioned_docs/version-1.0.0/comparison.md
b/website/versioned_docs/version-1.0.0/comparison.md
index 681b359a4de8..0bcce2ace532 100644
--- a/website/versioned_docs/version-1.0.0/comparison.md
+++ b/website/versioned_docs/version-1.0.0/comparison.md
@@ -52,5 +52,5 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
-to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
+to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/website/versioned_docs/version-1.0.0/configurations.md
b/website/versioned_docs/version-1.0.0/configurations.md
index 0758147e0df0..3a17558accc9 100644
--- a/website/versioned_docs/version-1.0.0/configurations.md
+++ b/website/versioned_docs/version-1.0.0/configurations.md
@@ -1825,7 +1825,7 @@ These set of configs are used for Hudi Streamer utility
which provides the way t
| [hoodie.streamer.sample.writes.size](#hoodiestreamersamplewritessize)
| 5000 | Number of records to sample
from the first write. To improve the estimation's accuracy, for smaller or more
compressable record size, set the sample size bigger. For bigger or less
compressable record size, set smaller.<br />`Config Param:
SAMPLE_WRITES_SIZE`<br />`Since Version: 0.14.0`
[...]
|
[hoodie.streamer.source.kafka.append.offsets](#hoodiestreamersourcekafkaappendoffsets)
| false | When enabled, appends kafka offset
info like source offset(_hoodie_kafka_source_offset), partition
(_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp)
to the records. By default its disabled and no kafka offsets are added<br
/>`Config Param: KAFKA_APPEND_OFFSETS`
[...]
|
[hoodie.streamer.source.sanitize.invalid.char.mask](#hoodiestreamersourcesanitizeinvalidcharmask)
| __ | Defines the character sequence that replaces
invalid characters in schema field names if
hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.<br
/>`Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK`
[...]
-|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/current/spec.html#names).<br />`Config Param:
SANITIZE_SCHEMA_FIELD_NAMES` [...]
+|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/++version++/specification/#names).<br />`Config
Param: SANITIZE_SCHEMA_FIELD_NAMES` [...]
---
diff --git a/website/versioned_docs/version-1.0.0/faq_storage.md
b/website/versioned_docs/version-1.0.0/faq_storage.md
index fcce76aa46e1..8917fdcb9abb 100644
--- a/website/versioned_docs/version-1.0.0/faq_storage.md
+++ b/website/versioned_docs/version-1.0.0/faq_storage.md
@@ -47,7 +47,7 @@ The indexing component is a key part of the Hudi writing and
it maps a given rec
Hudi supports a few options for indexing as below
* _HoodieBloomIndex_ : Uses a bloom filter and ranges information placed in
the footer of parquet/base files (and soon log files as well)
-* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some
cases, it might be necessary instead to do the de-duping/enforce uniqueness
across all partitions and the global bloom index does exactly that. If this i
[...]
+* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://www.uber.com/en-IN/blog/uber-big-data-platform/). However, in
some cases, it might be necessary instead to do the de-duping/enforce
uniqueness across all partitions and the global bloom index does exactly that
[...]
* _HBaseIndex_ : Apache HBase is a key value store, typically found in close
proximity to HDFS. You can also store the index inside HBase, which could be
handy if you are already operating HBase.
* _HoodieSimpleIndex (default)_ : A simple index which reads interested
fields (record key and partition path) from base files and joins with incoming
records to find the tagged location.
* _HoodieGlobalSimpleIndex_ : Global version of Simple Index, where in
uniqueness is on record key across entire table.
diff --git a/website/versioned_docs/version-1.0.0/hudi_stack.md
b/website/versioned_docs/version-1.0.0/hudi_stack.md
index d28231244187..472a1fe374e3 100644
--- a/website/versioned_docs/version-1.0.0/hudi_stack.md
+++ b/website/versioned_docs/version-1.0.0/hudi_stack.md
@@ -57,7 +57,7 @@ File Slices. File groups contain multiple versions of File
Slices and are split
the file-group is uniquely identified by the write that created its base file
or the first log file, which helps order the File Slices.
- **Metadata Table** : Implemented as another merge-on-read Hudi table, the
[metadata table](./metadata) efficiently handles quick updates with low write
amplification.
-It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html#sstables)
based file format for quick, indexed key lookups,
+It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage-engine.html#sstables)
based file format for quick, indexed key lookups,
storing vital information like file paths, column statistics and schema. This
approach streamlines operations by reducing the necessity for expensive cloud
file listings.
Hudi’s approach of recording updates into Log Files is more efficient and
involves low merge overhead than systems like Hive ACID, where merging all
delta records against
diff --git a/website/versioned_docs/version-1.0.0/metadata.md
b/website/versioned_docs/version-1.0.0/metadata.md
index 47661f314114..6ad199e7dec6 100644
--- a/website/versioned_docs/version-1.0.0/metadata.md
+++ b/website/versioned_docs/version-1.0.0/metadata.md
@@ -46,7 +46,7 @@ is tracked using internal tables. This approach provides the
following advantage
Following are the different types of metadata currently supported.
-- ***[files
listings](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
+- ***[files
listings](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)***:
Stored as *files* partition in the metadata table. Contains file information
such as file name, size, and active state
for each partition in the data table, along with list of all partitions in
the table. Improves the files listing performance
by avoiding direct storage calls such as *exists, listStatus* and
*listFiles* on the data table.
diff --git a/website/versioned_docs/version-1.0.0/overview.mdx
b/website/versioned_docs/version-1.0.0/overview.mdx
index bb8910f9c7ed..1e55d6916f3a 100644
--- a/website/versioned_docs/version-1.0.0/overview.mdx
+++ b/website/versioned_docs/version-1.0.0/overview.mdx
@@ -25,7 +25,7 @@ but it also allows you to create efficient incremental batch
pipelines. Apache H
Hudi’s advanced performance optimizations, make analytical queries/pipelines
faster with any of the popular query engines including, Apache Spark, Flink,
Presto, Trino, Hive, etc.
Read the docs for more [use case descriptions](/docs/use_cases) and check out
[who's using Hudi](/powered-by), to see how some of the
-largest data lakes in the world including
[Uber](https://eng.uber.com/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
+largest data lakes in the world including
[Uber](https://www.uber.com/en-IN/blog/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
[ByteDance](http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance),
[Robinhood](https://s.apache.org/hudi-robinhood-talk) and more are
transforming their production data lakes with Hudi.
diff --git a/website/versioned_docs/version-1.0.0/reading_tables_batch_reads.md
b/website/versioned_docs/version-1.0.0/reading_tables_batch_reads.md
index d247fd4c3d08..f3ddcd236694 100644
--- a/website/versioned_docs/version-1.0.0/reading_tables_batch_reads.md
+++ b/website/versioned_docs/version-1.0.0/reading_tables_batch_reads.md
@@ -32,4 +32,4 @@ df = df.where(df["foo"] > 5)
df.show()
```
-Check out the Daft docs for [Hudi
integration](https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/hudi.html).
+Check out the Daft docs for [Hudi
integration](https://docs.daft.ai/en/stable/connectors/hudi/).
diff --git a/website/versioned_docs/version-1.0.0/s3_hoodie.md
b/website/versioned_docs/version-1.0.0/s3_hoodie.md
index b990add7d4b7..3161ea4bd284 100644
--- a/website/versioned_docs/version-1.0.0/s3_hoodie.md
+++ b/website/versioned_docs/version-1.0.0/s3_hoodie.md
@@ -88,7 +88,7 @@ AWS glue data libraries are needed if AWS glue data is used
## AWS S3 Versioned Bucket
-With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner
utility](https://hudi.apache.orghoodie_cleaner) the number of Delete Markers
increases over time.
+With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner utility](cleaning) the number of Delete
Markers increases over time.
It is important to configure the [Lifecycle
Rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
correctly
to clean up these delete markers as the List operation can choke if the number
of delete markers reaches 1000.
We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.
\ No newline at end of file
diff --git a/website/versioned_docs/version-1.0.0/sql_queries.md
b/website/versioned_docs/version-1.0.0/sql_queries.md
index 3042af0a0d05..b51b9155ddeb 100644
--- a/website/versioned_docs/version-1.0.0/sql_queries.md
+++ b/website/versioned_docs/version-1.0.0/sql_queries.md
@@ -647,7 +647,7 @@ for more details.
## Doris
The Doris integration currently support Copy on Write and Merge On Read tables
in Hudi since version 0.10.0. You can query Hudi tables via Doris from Doris
version 2.0 Doris offers a multi-catalog, which is designed to make it easier
to connect to external data catalogs to enhance Doris's data lake analysis and
federated data query capabilities. Please refer
-to [Doris Hudi
Catalog](https://doris.apache.org/docs/lakehouse/datalake-analytics/hudi/) for
more details on the setup.
+to [Doris Hudi
Catalog](https://doris.apache.org/docs/3.x/lakehouse/catalogs/hudi-catalog) for
more details on the setup.
:::note
The current default supported version of Hudi is 0.10.0 ~ 0.13.1, and has not
been tested in other versions. More versions will be supported in the future.
diff --git a/website/versioned_docs/version-1.0.0/structure.md
b/website/versioned_docs/version-1.0.0/structure.md
index 137520dd2a54..0e15e353c30a 100644
--- a/website/versioned_docs/version-1.0.0/structure.md
+++ b/website/versioned_docs/version-1.0.0/structure.md
@@ -9,7 +9,7 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical tab
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
- * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](http://avro.apache.org/docs/current/mr))
+ * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](https://avro.apache.org/docs/++version++/mapreduce-guide/))
<figure>
<img className="docimage"
src={require("/assets/images/hudi_intro_1.png").default} alt="hudi_intro_1.png"
/>
diff --git a/website/versioned_docs/version-1.0.0/syncing_datahub.md
b/website/versioned_docs/version-1.0.0/syncing_datahub.md
index 89cf9bf87996..8cad3da38442 100644
--- a/website/versioned_docs/version-1.0.0/syncing_datahub.md
+++ b/website/versioned_docs/version-1.0.0/syncing_datahub.md
@@ -3,7 +3,7 @@ title: DataHub
keywords: [hudi, datahub, sync]
---
-[DataHub](https://datahubproject.io/) is a rich metadata platform that
supports features like data discovery, data
+[DataHub](https://datahub.com/) is a rich metadata platform that supports
features like data discovery, data
obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
diff --git a/website/versioned_docs/version-1.0.0/table_types.md
b/website/versioned_docs/version-1.0.0/table_types.md
index 3b7ec911bfc0..c2ae8baab9eb 100644
--- a/website/versioned_docs/version-1.0.0/table_types.md
+++ b/website/versioned_docs/version-1.0.0/table_types.md
@@ -204,4 +204,4 @@ Refer
[here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for
* [Comparing Apache Hudi's MOR and COW Tables, Use Cases from
Uber](https://youtu.be/BiTXyzFNHlA)
* [Different table types in Apache Hudi, MOR and COW, Deep
Dive](https://youtu.be/vyEvlt57L-s)
-* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQx)
\ No newline at end of file
+* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQ)
\ No newline at end of file
diff --git a/website/versioned_docs/version-1.0.0/troubleshooting.md
b/website/versioned_docs/version-1.0.0/troubleshooting.md
index 4696694d41d8..47de1002beae 100644
--- a/website/versioned_docs/version-1.0.0/troubleshooting.md
+++ b/website/versioned_docs/version-1.0.0/troubleshooting.md
@@ -40,7 +40,7 @@ You can increase `hoodie.commits.archival.batch` moving
forward to increase the
In addition, you can increase the difference between the 2 watermark
configurations : `hoodie.keep.max.commits` (default : 30)
and `hoodie.keep.min.commits` (default : 20). This way, you can reduce the
number of archive files created and also
at the same time increase the number of metadata archived per archive file.
Note that post 0.7.0 release where we are
-adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)),
+adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)),
the follow up work would involve re-organizing archival metadata so that we
can do periodic compactions to control
file-sizing of these archive files.
diff --git a/website/versioned_docs/version-1.0.0/tuning-guide.md
b/website/versioned_docs/version-1.0.0/tuning-guide.md
index 4a1f72f1b05f..107fa6e67c70 100644
--- a/website/versioned_docs/version-1.0.0/tuning-guide.md
+++ b/website/versioned_docs/version-1.0.0/tuning-guide.md
@@ -57,7 +57,7 @@ When upsert large input data, hudi spills part of input data
to disk when reach
### How to tune shuffle parallelism of Hudi jobs ?
-First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typicall [...]
+First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typic [...]
(Spark Application → N Spark Jobs → M Spark Stages → T Spark Tasks) on (E
executors with C cores)
diff --git a/website/versioned_docs/version-1.0.1/compaction.md
b/website/versioned_docs/version-1.0.1/compaction.md
index 6025d89916be..500687c658e5 100644
--- a/website/versioned_docs/version-1.0.1/compaction.md
+++ b/website/versioned_docs/version-1.0.1/compaction.md
@@ -13,7 +13,7 @@ not applicable to Copy On Write(COW) tables and only applies
to MOR tables.
### Why MOR tables need compaction?
To understand the significance of compaction in MOR tables, it is helpful to
understand the MOR table layout first. In Hudi,
-data is organized in terms of [file
groups](https://hudi.apache.org/docs/file_layouts/). Each file group in a MOR
table
+data is organized in terms of [file groups](storage_layouts). Each file group
in a MOR table
consists of a base file and one or more log files. Typically, during writes,
inserts are stored in the base file, and updates
are appended to log files.
diff --git a/website/versioned_docs/version-1.0.1/comparison.md
b/website/versioned_docs/version-1.0.1/comparison.md
index 681b359a4de8..7ba799e1453e 100644
--- a/website/versioned_docs/version-1.0.1/comparison.md
+++ b/website/versioned_docs/version-1.0.1/comparison.md
@@ -52,5 +52,5 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
-to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
+and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
\ No newline at end of file
diff --git a/website/versioned_docs/version-1.0.1/configurations.md
b/website/versioned_docs/version-1.0.1/configurations.md
index 0758147e0df0..3a17558accc9 100644
--- a/website/versioned_docs/version-1.0.1/configurations.md
+++ b/website/versioned_docs/version-1.0.1/configurations.md
@@ -1825,7 +1825,7 @@ These set of configs are used for Hudi Streamer utility
which provides the way t
| [hoodie.streamer.sample.writes.size](#hoodiestreamersamplewritessize)
| 5000 | Number of records to sample
from the first write. To improve the estimation's accuracy, for smaller or more
compressable record size, set the sample size bigger. For bigger or less
compressable record size, set smaller.<br />`Config Param:
SAMPLE_WRITES_SIZE`<br />`Since Version: 0.14.0`
[...]
|
[hoodie.streamer.source.kafka.append.offsets](#hoodiestreamersourcekafkaappendoffsets)
| false | When enabled, appends kafka offset
info like source offset(_hoodie_kafka_source_offset), partition
(_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp)
to the records. By default its disabled and no kafka offsets are added<br
/>`Config Param: KAFKA_APPEND_OFFSETS`
[...]
|
[hoodie.streamer.source.sanitize.invalid.char.mask](#hoodiestreamersourcesanitizeinvalidcharmask)
| __ | Defines the character sequence that replaces
invalid characters in schema field names if
hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.<br
/>`Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK`
[...]
-|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/current/spec.html#names).<br />`Config Param:
SANITIZE_SCHEMA_FIELD_NAMES` [...]
+|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/++version++/specification/#names).<br />`Config
Param: SANITIZE_SCHEMA_FIELD_NAMES` [...]
---
diff --git a/website/versioned_docs/version-1.0.1/faq_storage.md
b/website/versioned_docs/version-1.0.1/faq_storage.md
index fcce76aa46e1..8917fdcb9abb 100644
--- a/website/versioned_docs/version-1.0.1/faq_storage.md
+++ b/website/versioned_docs/version-1.0.1/faq_storage.md
@@ -47,7 +47,7 @@ The indexing component is a key part of the Hudi writing and
it maps a given rec
Hudi supports a few options for indexing as below
* _HoodieBloomIndex_ : Uses a bloom filter and ranges information placed in
the footer of parquet/base files (and soon log files as well)
-* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some
cases, it might be necessary instead to do the de-duping/enforce uniqueness
across all partitions and the global bloom index does exactly that. If this i
[...]
+* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://www.uber.com/en-IN/blog/uber-big-data-platform/). However, in
some cases, it might be necessary instead to do the de-duping/enforce
uniqueness across all partitions and the global bloom index does exactly that
[...]
* _HBaseIndex_ : Apache HBase is a key value store, typically found in close
proximity to HDFS. You can also store the index inside HBase, which could be
handy if you are already operating HBase.
* _HoodieSimpleIndex (default)_ : A simple index which reads interested
fields (record key and partition path) from base files and joins with incoming
records to find the tagged location.
* _HoodieGlobalSimpleIndex_ : Global version of Simple Index, where in
uniqueness is on record key across entire table.
diff --git a/website/versioned_docs/version-1.0.1/hudi_stack.md
b/website/versioned_docs/version-1.0.1/hudi_stack.md
index d3e0fb335353..7989c59fff79 100644
--- a/website/versioned_docs/version-1.0.1/hudi_stack.md
+++ b/website/versioned_docs/version-1.0.1/hudi_stack.md
@@ -57,7 +57,7 @@ File Slices. File groups contain multiple versions of File
Slices and are split
the file-group is uniquely identified by the write that created its base file
or the first log file, which helps order the File Slices.
- **Metadata Table** : Implemented as another merge-on-read Hudi table, the
[metadata table](./metadata) efficiently handles quick updates with low write
amplification.
-It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html#sstables)
based file format for quick, indexed key lookups,
+It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage-engine.html#sstables)
based file format for quick, indexed key lookups,
storing vital information like file paths, column statistics and schema. This
approach streamlines operations by reducing the necessity for expensive cloud
file listings.
Hudi’s approach of recording updates into Log Files is more efficient and
involves low merge overhead than systems like Hive ACID, where merging all
delta records against
diff --git a/website/versioned_docs/version-1.0.1/metadata.md
b/website/versioned_docs/version-1.0.1/metadata.md
index 8f3b403112ac..fe8827ebeec5 100644
--- a/website/versioned_docs/version-1.0.1/metadata.md
+++ b/website/versioned_docs/version-1.0.1/metadata.md
@@ -46,7 +46,7 @@ is tracked using internal tables. This approach provides the
following advantage
Following are the different types of metadata currently supported.
-- ***[files
listings](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
+- ***[files
listings](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)***:
Stored as *files* partition in the metadata table. Contains file information
such as file name, size, and active state
for each partition in the data table, along with list of all partitions in
the table. Improves the files listing performance
by avoiding direct storage calls such as *exists, listStatus* and
*listFiles* on the data table.
diff --git a/website/versioned_docs/version-1.0.1/overview.mdx
b/website/versioned_docs/version-1.0.1/overview.mdx
index bb8910f9c7ed..1e55d6916f3a 100644
--- a/website/versioned_docs/version-1.0.1/overview.mdx
+++ b/website/versioned_docs/version-1.0.1/overview.mdx
@@ -25,7 +25,7 @@ but it also allows you to create efficient incremental batch
pipelines. Apache H
Hudi’s advanced performance optimizations, make analytical queries/pipelines
faster with any of the popular query engines including, Apache Spark, Flink,
Presto, Trino, Hive, etc.
Read the docs for more [use case descriptions](/docs/use_cases) and check out
[who's using Hudi](/powered-by), to see how some of the
-largest data lakes in the world including
[Uber](https://eng.uber.com/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
+largest data lakes in the world including
[Uber](https://www.uber.com/en-IN/blog/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
[ByteDance](http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance),
[Robinhood](https://s.apache.org/hudi-robinhood-talk) and more are
transforming their production data lakes with Hudi.
diff --git a/website/versioned_docs/version-1.0.1/reading_tables_batch_reads.md
b/website/versioned_docs/version-1.0.1/reading_tables_batch_reads.md
index d247fd4c3d08..f3ddcd236694 100644
--- a/website/versioned_docs/version-1.0.1/reading_tables_batch_reads.md
+++ b/website/versioned_docs/version-1.0.1/reading_tables_batch_reads.md
@@ -32,4 +32,4 @@ df = df.where(df["foo"] > 5)
df.show()
```
-Check out the Daft docs for [Hudi
integration](https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/hudi.html).
+Check out the Daft docs for [Hudi
integration](https://docs.daft.ai/en/stable/connectors/hudi/).
diff --git a/website/versioned_docs/version-1.0.1/s3_hoodie.md
b/website/versioned_docs/version-1.0.1/s3_hoodie.md
index 37f79ae75342..3161ea4bd284 100644
--- a/website/versioned_docs/version-1.0.1/s3_hoodie.md
+++ b/website/versioned_docs/version-1.0.1/s3_hoodie.md
@@ -88,7 +88,7 @@ AWS glue data libraries are needed if AWS glue data is used
## AWS S3 Versioned Bucket
-With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner
utility](https://hudi.apache.org/docs/hoodie_cleaner) the number of Delete
Markers increases over time.
+With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner utility](cleaning) the number of Delete
Markers increases over time.
It is important to configure the [Lifecycle
Rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
correctly
to clean up these delete markers as the List operation can choke if the number
of delete markers reaches 1000.
We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.
\ No newline at end of file
diff --git a/website/versioned_docs/version-1.0.1/sql_queries.md
b/website/versioned_docs/version-1.0.1/sql_queries.md
index 3042af0a0d05..b51b9155ddeb 100644
--- a/website/versioned_docs/version-1.0.1/sql_queries.md
+++ b/website/versioned_docs/version-1.0.1/sql_queries.md
@@ -647,7 +647,7 @@ for more details.
## Doris
The Doris integration currently support Copy on Write and Merge On Read tables
in Hudi since version 0.10.0. You can query Hudi tables via Doris from Doris
version 2.0 Doris offers a multi-catalog, which is designed to make it easier
to connect to external data catalogs to enhance Doris's data lake analysis and
federated data query capabilities. Please refer
-to [Doris Hudi
Catalog](https://doris.apache.org/docs/lakehouse/datalake-analytics/hudi/) for
more details on the setup.
+to [Doris Hudi
Catalog](https://doris.apache.org/docs/3.x/lakehouse/catalogs/hudi-catalog) for
more details on the setup.
:::note
The current default supported version of Hudi is 0.10.0 ~ 0.13.1, and has not
been tested in other versions. More versions will be supported in the future.
diff --git a/website/versioned_docs/version-1.0.1/structure.md
b/website/versioned_docs/version-1.0.1/structure.md
index 137520dd2a54..0e15e353c30a 100644
--- a/website/versioned_docs/version-1.0.1/structure.md
+++ b/website/versioned_docs/version-1.0.1/structure.md
@@ -9,7 +9,7 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical tab
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
- * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](http://avro.apache.org/docs/current/mr))
+ * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](https://avro.apache.org/docs/++version++/mapreduce-guide/))
<figure>
<img className="docimage"
src={require("/assets/images/hudi_intro_1.png").default} alt="hudi_intro_1.png"
/>
diff --git a/website/versioned_docs/version-1.0.1/syncing_datahub.md
b/website/versioned_docs/version-1.0.1/syncing_datahub.md
index 2a8003a2eec6..28803704c161 100644
--- a/website/versioned_docs/version-1.0.1/syncing_datahub.md
+++ b/website/versioned_docs/version-1.0.1/syncing_datahub.md
@@ -3,7 +3,7 @@ title: DataHub
keywords: [hudi, datahub, sync]
---
-[DataHub](https://datahubproject.io/) is a rich metadata platform that
supports features like data discovery, data
+[DataHub](https://datahub.com/) is a rich metadata platform that supports
features like data discovery, data
obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
diff --git a/website/versioned_docs/version-1.0.1/troubleshooting.md
b/website/versioned_docs/version-1.0.1/troubleshooting.md
index 4696694d41d8..47de1002beae 100644
--- a/website/versioned_docs/version-1.0.1/troubleshooting.md
+++ b/website/versioned_docs/version-1.0.1/troubleshooting.md
@@ -40,7 +40,7 @@ You can increase `hoodie.commits.archival.batch` moving
forward to increase the
In addition, you can increase the difference between the 2 watermark
configurations : `hoodie.keep.max.commits` (default : 30)
and `hoodie.keep.min.commits` (default : 20). This way, you can reduce the
number of archive files created and also
at the same time increase the number of metadata archived per archive file.
Note that post 0.7.0 release where we are
-adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)),
+adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)),
the follow up work would involve re-organizing archival metadata so that we
can do periodic compactions to control
file-sizing of these archive files.
diff --git a/website/versioned_docs/version-1.0.1/tuning-guide.md
b/website/versioned_docs/version-1.0.1/tuning-guide.md
index 4a1f72f1b05f..107fa6e67c70 100644
--- a/website/versioned_docs/version-1.0.1/tuning-guide.md
+++ b/website/versioned_docs/version-1.0.1/tuning-guide.md
@@ -57,7 +57,7 @@ When upsert large input data, hudi spills part of input data
to disk when reach
### How to tune shuffle parallelism of Hudi jobs ?
-First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typicall [...]
+First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typic [...]
(Spark Application → N Spark Jobs → M Spark Stages → T Spark Tasks) on (E
executors with C cores)
diff --git a/website/versioned_docs/version-1.0.2/compaction.md
b/website/versioned_docs/version-1.0.2/compaction.md
index 6af8e6361c19..35d781fa44e4 100644
--- a/website/versioned_docs/version-1.0.2/compaction.md
+++ b/website/versioned_docs/version-1.0.2/compaction.md
@@ -13,7 +13,7 @@ not applicable to Copy On Write(COW) tables and only applies
to MOR tables.
### Why MOR tables need compaction?
To understand the significance of compaction in MOR tables, it is helpful to
understand the MOR table layout first. In Hudi,
-data is organized in terms of [file
groups](https://hudi.apache.org/docs/file_layouts/). Each file group in a MOR
table
+data is organized in terms of [file groups](/docs/storage_layouts/). Each file
group in a MOR table
consists of a base file and one or more log files. Typically, during writes,
inserts are stored in the base file, and updates
are appended to log files.
diff --git a/website/versioned_docs/version-1.0.2/comparison.md
b/website/versioned_docs/version-1.0.2/comparison.md
index 681b359a4de8..0bcce2ace532 100644
--- a/website/versioned_docs/version-1.0.2/comparison.md
+++ b/website/versioned_docs/version-1.0.2/comparison.md
@@ -52,5 +52,5 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
-to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
+to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/website/versioned_docs/version-1.0.2/configurations.md
b/website/versioned_docs/version-1.0.2/configurations.md
index 2e05446ae5a7..022b2b172f23 100644
--- a/website/versioned_docs/version-1.0.2/configurations.md
+++ b/website/versioned_docs/version-1.0.2/configurations.md
@@ -1851,7 +1851,7 @@ These set of configs are used for Hudi Streamer utility
which provides the way t
| [hoodie.streamer.sample.writes.size](#hoodiestreamersamplewritessize)
| 5000 | Number of records to sample
from the first write. To improve the estimation's accuracy, for smaller or more
compressable record size, set the sample size bigger. For bigger or less
compressable record size, set smaller.<br />`Config Param:
SAMPLE_WRITES_SIZE`<br />`Since Version: 0.14.0`
[...]
|
[hoodie.streamer.source.kafka.append.offsets](#hoodiestreamersourcekafkaappendoffsets)
| false | When enabled, appends kafka offset
info like source offset(_hoodie_kafka_source_offset), partition
(_hoodie_kafka_source_partition) and timestamp (_hoodie_kafka_source_timestamp)
to the records. By default its disabled and no kafka offsets are added<br
/>`Config Param: KAFKA_APPEND_OFFSETS`
[...]
|
[hoodie.streamer.source.sanitize.invalid.char.mask](#hoodiestreamersourcesanitizeinvalidcharmask)
| __ | Defines the character sequence that replaces
invalid characters in schema field names if
hoodie.streamer.source.sanitize.invalid.schema.field.names is enabled.<br
/>`Config Param: SCHEMA_FIELD_NAME_INVALID_CHAR_MASK`
[...]
-|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/current/spec.html#names).<br />`Config Param:
SANITIZE_SCHEMA_FIELD_NAMES` [...]
+|
[hoodie.streamer.source.sanitize.invalid.schema.field.names](#hoodiestreamersourcesanitizeinvalidschemafieldnames)
| false | Sanitizes names of invalid schema fields both in the data read
from source and also in the schema Replaces invalid characters with
hoodie.streamer.source.sanitize.invalid.char.mask. Invalid characters are by
goes by avro naming convention
(https://avro.apache.org/docs/++version++/specification/#names).<br />`Config
Param: SANITIZE_SCHEMA_FIELD_NAMES` [...]
---
diff --git a/website/versioned_docs/version-1.0.2/faq_storage.md
b/website/versioned_docs/version-1.0.2/faq_storage.md
index fcce76aa46e1..8917fdcb9abb 100644
--- a/website/versioned_docs/version-1.0.2/faq_storage.md
+++ b/website/versioned_docs/version-1.0.2/faq_storage.md
@@ -47,7 +47,7 @@ The indexing component is a key part of the Hudi writing and
it maps a given rec
Hudi supports a few options for indexing as below
* _HoodieBloomIndex_ : Uses a bloom filter and ranges information placed in
the footer of parquet/base files (and soon log files as well)
-* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some
cases, it might be necessary instead to do the de-duping/enforce uniqueness
across all partitions and the global bloom index does exactly that. If this i
[...]
+* _HoodieGlobalBloomIndex_ : The non global indexing only enforces
uniqueness of a key inside a single partition i.e the user is expected to know
the partition under which a given record key is stored. This helps the indexing
scale very well for even [very large
datasets](https://www.uber.com/en-IN/blog/uber-big-data-platform/). However, in
some cases, it might be necessary instead to do the de-duping/enforce
uniqueness across all partitions and the global bloom index does exactly that
[...]
* _HBaseIndex_ : Apache HBase is a key value store, typically found in close
proximity to HDFS. You can also store the index inside HBase, which could be
handy if you are already operating HBase.
* _HoodieSimpleIndex (default)_ : A simple index which reads interested
fields (record key and partition path) from base files and joins with incoming
records to find the tagged location.
* _HoodieGlobalSimpleIndex_ : Global version of Simple Index, where in
uniqueness is on record key across entire table.
diff --git a/website/versioned_docs/version-1.0.2/hudi_stack.md
b/website/versioned_docs/version-1.0.2/hudi_stack.md
index d28231244187..64d28643d39d 100644
--- a/website/versioned_docs/version-1.0.2/hudi_stack.md
+++ b/website/versioned_docs/version-1.0.2/hudi_stack.md
@@ -49,19 +49,19 @@ bring any compute engine for specific workloads.
Drawing an analogy to file formats, a table format simply concerns with how
files are distributed with the table, partitioning schemes, schema and metadata
tracking changes. Hudi organizes files within a table or partition into
File Groups. Updates are captured in log files tied to these File Groups,
ensuring efficient merges. There are three major components related to Hudi’s
table format.
-- **Timeline** : Hudi's [timeline](./timeline), stored in the
`/.hoodie/timeline` folder, is a crucial event log recording all table actions
in an ordered manner,
+- **Timeline** : Hudi's [timeline](/docs/timeline), stored in the
`/.hoodie/timeline` folder, is a crucial event log recording all table actions
in an ordered manner,
with events kept for a specified period. Hudi uniquely designs each File
Group as a self-contained log, enabling record state reconstruction through
delta logs, even after archival of historical actions. This approach
effectively limits metadata size based on table activity frequency, essential
for managing tables with frequent updates.
- **File Group and File Slice** : Within each partition the data is physically
stored as base and Log Files and organized into logical concepts as [File
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and
File Slices. File groups contain multiple versions of File Slices and are
split into multiple File Slices. A File Slice comprises the Base and Log File.
Each File Slice within
the file-group is uniquely identified by the write that created its base file
or the first log file, which helps order the File Slices.
-- **Metadata Table** : Implemented as another merge-on-read Hudi table, the
[metadata table](./metadata) efficiently handles quick updates with low write
amplification.
-It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html#sstables)
based file format for quick, indexed key lookups,
+- **Metadata Table** : Implemented as another merge-on-read Hudi table, the
[metadata table](/docs/metadata) efficiently handles quick updates with low
write amplification.
+It leverages a
[SSTable](https://cassandra.apache.org/doc/stable/cassandra/architecture/storage-engine.html#sstables)
based file format for quick, indexed key lookups,
storing vital information like file paths, column statistics and schema. This
approach streamlines operations by reducing the necessity for expensive cloud
file listings.
Hudi’s approach of recording updates into Log Files is more efficient and
involves low merge overhead than systems like Hive ACID, where merging all
delta records against
-all Base Files is required. Read more about the various table types in Hudi
[here](./table_types).
+all Base Files is required. Read more about the various table types in Hudi
[here](/docs/table_types).
## Storage Engine
@@ -74,8 +74,8 @@ Cassandra and Clickhouse.

<p align = "center">Figure: Indexes in Hudi</p>
-[Indexes](./indexes) in Hudi enhance query planning, minimizing I/O, speeding
up response times and providing faster writes with low merge costs. The
[metadata table](./metadata/#metadata-table-indices) acts
-as an additional [indexing
system](./metadata#supporting-multi-modal-index-in-hudi) and brings the
benefits of indexes generally to both the readers and writers. Compute engines
can leverage various indexes in the metadata
+[Indexes](/docs/indexes) in Hudi enhance query planning, minimizing I/O,
speeding up response times and providing faster writes with low merge costs.
The [metadata table](/docs/metadata/#metadata-table-indices) acts
+as an additional [indexing
system](/docs/metadata#supporting-multi-modal-index-in-hudi) and brings the
benefits of indexes generally to both the readers and writers. Compute engines
can leverage various indexes in the metadata
table, like file listings, column statistics, bloom filters, record-level
indexes, and [expression
indexes](https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md) to
quickly generate optimized query plans and improve read
performance. In addition to the metadata table indexes, Hudi supports simple
join based indexing, bloom filters stored in base file footers, external
key-value stores like HBase,
and optimized storage techniques like bucketing , to efficiently locate File
Groups containing specific record keys. Hudi also provides reader indexes such
as
[expression](https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md)
and
@@ -91,12 +91,12 @@ running them in inline, semi-asynchronous or
full-asynchronous modes. Furthermor
asynchronously sharing the underlying executors intelligently with writers.
Let’s take a look at these services.
#### Clustering
-The [clustering](./clustering) service, akin to features in cloud data
warehouses, allows users to group frequently queried records using sort keys or
merge smaller Base Files into
+The [clustering](/docs/clustering) service, akin to features in cloud data
warehouses, allows users to group frequently queried records using sort keys or
merge smaller Base Files into
larger ones for optimal file size management. It's fully integrated with other
timeline actions like cleaning and compaction, enabling smart optimizations
such as avoiding
compaction for File Groups undergoing clustering, thereby saving on I/O.
#### Compaction
-Hudi's [compaction](./compaction) service, featuring strategies like date
partitioning and I/O bounding, merges Base Files with delta logs to create
updated Base Files. It allows
+Hudi's [compaction](/docs/compaction) service, featuring strategies like date
partitioning and I/O bounding, merges Base Files with delta logs to create
updated Base Files. It allows
concurrent writes to the same File Froup, enabled by Hudi's file grouping and
flexible log merging. This facilitates non-blocking execution of deletes even
during concurrent
record updates.
@@ -107,11 +107,11 @@ while also allowing sufficient time for long running
batch jobs (e.g Hive ETLs)
#### Indexing
Hudi's scalable metadata table contains auxiliary data about the table. This
subsystem encompasses various indices, including files, column_stats, and
bloom_filters,
facilitating efficient record location and data skipping. Balancing write
throughput with index updates presents a fundamental challenge, as traditional
indexing methods,
-like locking out writes during indexing, are impractical for large tables due
to lengthy processing times. Hudi addresses this with its innovative
asynchronous [metadata indexing](./metadata_indexing),
+like locking out writes during indexing, are impractical for large tables due
to lengthy processing times. Hudi addresses this with its innovative
asynchronous [metadata indexing](/docs/metadata_indexing),
enabling the creation of various indices without impeding writes. This
approach not only improves write latency but also minimizes resource waste by
reducing contention between writing and indexing activities.
### Concurrency Control
-[Concurrency control](./concurrency_control) defines how different
writers/readers/table services coordinate access to the table. Hudi uses
monotonically increasing time to sequence and order various
+[Concurrency control](/docs/concurrency_control) defines how different
writers/readers/table services coordinate access to the table. Hudi uses
monotonically increasing time to sequence and order various
changes to table state. Much like databases, Hudi take an approach of clearly
differentiating between writers (responsible for upserts/deletes), table
services
(focusing on storage optimization and bookkeeping), and readers (for query
execution). Hudi provides snapshot isolation, offering a consistent view of the
table across
these different operations. It employs lock-free, non-blocking MVCC for
concurrency between writers and table-services, as well as between different
table services, and
@@ -154,12 +154,12 @@ integration with engines written in C/C++.
<p align = "center">Figure: Various platform services in Hudi</p>
Platform services offer functionality that is specific to data and workloads,
and they sit directly on top of the table services, interfacing with writers
and readers.
-Services, like [Hudi Streamer](./hoodie_streaming_ingestion#hudi-streamer) (or
its Flink counterpart), are specialized in handling data and workloads,
seamlessly integrating with Kafka streams and various
+Services, like [Hudi Streamer](/docs/hoodie_streaming_ingestion#hudi-streamer)
(or its Flink counterpart), are specialized in handling data and workloads,
seamlessly integrating with Kafka streams and various
formats to build data lakes. They support functionalities like automatic
checkpoint management, integration with major schema registries (including
Confluent), and
deduplication of data. Hudi Streamer also offers features for backfills,
one-off runs, and continuous mode operation with Spark/Flink streaming writers.
Additionally,
-Hudi provides tools for [snapshotting](./snapshot_exporter) and incrementally
[exporting](./snapshot_exporter#examples) Hudi tables, importing new tables,
and [post-commit callback](platform_services_post_commit_callback) for
analytics or
+Hudi provides tools for [snapshotting](/docs/snapshot_exporter) and
incrementally [exporting](/docs/snapshot_exporter#examples) Hudi tables,
importing new tables, and [post-commit
callback](/docs/platform_services_post_commit_callback) for analytics or
workflow management, enhancing the deployment of production-grade incremental
pipelines. Apart from these services, Hudi also provides broad support for
different
-catalogs such as [Hive Metastore](./syncing_metastore), [AWS
Glue](./syncing_aws_glue_data_catalog/), [Google BigQuery](./gcp_bigquery),
[DataHub](./syncing_datahub), etc. that allows syncing of Hudi tables to be
queried by
+catalogs such as [Hive Metastore](/docs/syncing_metastore), [AWS
Glue](/docs/syncing_aws_glue_data_catalog/), [Google
BigQuery](/docs/gcp_bigquery), [DataHub](/docs/syncing_datahub), etc. that
allows syncing of Hudi tables to be queried by
interactive engines such as Trino and Presto.
### Metaserver*
diff --git a/website/versioned_docs/version-1.0.2/metadata.md
b/website/versioned_docs/version-1.0.2/metadata.md
index 8f3b403112ac..fe8827ebeec5 100644
--- a/website/versioned_docs/version-1.0.2/metadata.md
+++ b/website/versioned_docs/version-1.0.2/metadata.md
@@ -46,7 +46,7 @@ is tracked using internal tables. This approach provides the
following advantage
Following are the different types of metadata currently supported.
-- ***[files
listings](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
+- ***[files
listings](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)***:
Stored as *files* partition in the metadata table. Contains file information
such as file name, size, and active state
for each partition in the data table, along with list of all partitions in
the table. Improves the files listing performance
by avoiding direct storage calls such as *exists, listStatus* and
*listFiles* on the data table.
diff --git a/website/versioned_docs/version-1.0.2/overview.mdx
b/website/versioned_docs/version-1.0.2/overview.mdx
index bb8910f9c7ed..1e55d6916f3a 100644
--- a/website/versioned_docs/version-1.0.2/overview.mdx
+++ b/website/versioned_docs/version-1.0.2/overview.mdx
@@ -25,7 +25,7 @@ but it also allows you to create efficient incremental batch
pipelines. Apache H
Hudi’s advanced performance optimizations, make analytical queries/pipelines
faster with any of the popular query engines including, Apache Spark, Flink,
Presto, Trino, Hive, etc.
Read the docs for more [use case descriptions](/docs/use_cases) and check out
[who's using Hudi](/powered-by), to see how some of the
-largest data lakes in the world including
[Uber](https://eng.uber.com/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
+largest data lakes in the world including
[Uber](https://www.uber.com/en-IN/blog/uber-big-data-platform/),
[Amazon](https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/),
[ByteDance](http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance),
[Robinhood](https://s.apache.org/hudi-robinhood-talk) and more are
transforming their production data lakes with Hudi.
diff --git a/website/versioned_docs/version-1.0.2/s3_hoodie.md
b/website/versioned_docs/version-1.0.2/s3_hoodie.md
index 37f79ae75342..fac2f76d61d2 100644
--- a/website/versioned_docs/version-1.0.2/s3_hoodie.md
+++ b/website/versioned_docs/version-1.0.2/s3_hoodie.md
@@ -88,7 +88,7 @@ AWS glue data libraries are needed if AWS glue data is used
## AWS S3 Versioned Bucket
-With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner
utility](https://hudi.apache.org/docs/hoodie_cleaner) the number of Delete
Markers increases over time.
+With versioned buckets any object deleted creates a [Delete
Marker](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html),
as Hudi cleans up files using [Cleaner utility](/docs/cleaning) the number of
Delete Markers increases over time.
It is important to configure the [Lifecycle
Rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
correctly
to clean up these delete markers as the List operation can choke if the number
of delete markers reaches 1000.
We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.
\ No newline at end of file
diff --git a/website/versioned_docs/version-1.0.2/sql_queries.md
b/website/versioned_docs/version-1.0.2/sql_queries.md
index 6f8c33026e05..d1fa84b9578b 100644
--- a/website/versioned_docs/version-1.0.2/sql_queries.md
+++ b/website/versioned_docs/version-1.0.2/sql_queries.md
@@ -647,7 +647,7 @@ for more details.
## Doris
The Doris integration currently support Copy on Write and Merge On Read tables
in Hudi since version 0.10.0. You can query Hudi tables via Doris from Doris
version 2.0. Doris offers a multi-catalog, which is designed to make it easier
to connect to external data catalogs to enhance Doris's data lake analysis and
federated data query capabilities. Please refer
-to [Doris Hudi
Catalog](https://doris.apache.org/docs/lakehouse/datalake-analytics/hudi/) for
more details on the setup.
+to [Doris Hudi
Catalog](https://doris.apache.org/docs/3.x/lakehouse/catalogs/hudi-catalog) for
more details on the setup.
:::note
The current default supported version of Hudi is 0.10.0 ~ 0.13.1, and has not
been tested in other versions. More versions will be supported in the future.
diff --git a/website/versioned_docs/version-1.0.2/structure.md
b/website/versioned_docs/version-1.0.2/structure.md
index 137520dd2a54..0e15e353c30a 100644
--- a/website/versioned_docs/version-1.0.2/structure.md
+++ b/website/versioned_docs/version-1.0.2/structure.md
@@ -9,7 +9,7 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical tab
* **Read Optimized query** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
* **Incremental query** - Provides a change stream out of the dataset to feed
downstream jobs/ETLs.
- * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](http://avro.apache.org/docs/current/mr))
+ * **Snapshot query** - Provides queries on real-time data, using a
combination of columnar & row based storage (e.g Parquet +
[Avro](https://avro.apache.org/docs/++version++/mapreduce-guide/))
<figure>
<img className="docimage"
src={require("/assets/images/hudi_intro_1.png").default} alt="hudi_intro_1.png"
/>
diff --git a/website/versioned_docs/version-1.0.2/syncing_datahub.md
b/website/versioned_docs/version-1.0.2/syncing_datahub.md
index 2a8003a2eec6..28803704c161 100644
--- a/website/versioned_docs/version-1.0.2/syncing_datahub.md
+++ b/website/versioned_docs/version-1.0.2/syncing_datahub.md
@@ -3,7 +3,7 @@ title: DataHub
keywords: [hudi, datahub, sync]
---
-[DataHub](https://datahubproject.io/) is a rich metadata platform that
supports features like data discovery, data
+[DataHub](https://datahub.com/) is a rich metadata platform that supports
features like data discovery, data
obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
diff --git a/website/versioned_docs/version-1.0.2/table_types.md
b/website/versioned_docs/version-1.0.2/table_types.md
index 3b7ec911bfc0..c2ae8baab9eb 100644
--- a/website/versioned_docs/version-1.0.2/table_types.md
+++ b/website/versioned_docs/version-1.0.2/table_types.md
@@ -204,4 +204,4 @@ Refer
[here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for
* [Comparing Apache Hudi's MOR and COW Tables, Use Cases from
Uber](https://youtu.be/BiTXyzFNHlA)
* [Different table types in Apache Hudi, MOR and COW, Deep
Dive](https://youtu.be/vyEvlt57L-s)
-* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQx)
\ No newline at end of file
+* [How to Query Hudi Tables in Incremental Fashion and Get only New data on
AWS Glue | Hands on Lab](https://www.youtube.com/watch?v=c6DCJR91rBQ)
\ No newline at end of file
diff --git a/website/versioned_docs/version-1.0.2/troubleshooting.md
b/website/versioned_docs/version-1.0.2/troubleshooting.md
index 4696694d41d8..47de1002beae 100644
--- a/website/versioned_docs/version-1.0.2/troubleshooting.md
+++ b/website/versioned_docs/version-1.0.2/troubleshooting.md
@@ -40,7 +40,7 @@ You can increase `hoodie.commits.archival.batch` moving
forward to increase the
In addition, you can increase the difference between the 2 watermark
configurations : `hoodie.keep.max.commits` (default : 30)
and `hoodie.keep.min.commits` (default : 20). This way, you can reduce the
number of archive files created and also
at the same time increase the number of metadata archived per archive file.
Note that post 0.7.0 release where we are
-adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)),
+adding consolidated Hudi metadata
([RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331)),
the follow up work would involve re-organizing archival metadata so that we
can do periodic compactions to control
file-sizing of these archive files.
diff --git a/website/versioned_docs/version-1.0.2/tuning-guide.md
b/website/versioned_docs/version-1.0.2/tuning-guide.md
index 4a1f72f1b05f..107fa6e67c70 100644
--- a/website/versioned_docs/version-1.0.2/tuning-guide.md
+++ b/website/versioned_docs/version-1.0.2/tuning-guide.md
@@ -57,7 +57,7 @@ When upsert large input data, hudi spills part of input data
to disk when reach
### How to tune shuffle parallelism of Hudi jobs ?
-First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typicall [...]
+First, let's understand what the term parallelism means in the context of Hudi
jobs. For any Hudi job using Spark, parallelism equals to the number of spark
partitions that should be generated for a particular stage in the DAG. To
understand more about spark partitions, read this
[article](https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297).
In spark, each spark partition is mapped to a spark task that can be executed
on an executor. Typic [...]
(Spark Application → N Spark Jobs → M Spark Stages → T Spark Tasks) on (E
executors with C cores)