This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 4ce0db3b93 [DOCS] update broken links (#5333)
4ce0db3b93 is described below
commit 4ce0db3b93967158b5e854d8230d71a38e221c77
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Mon Apr 18 16:22:51 2022 -0700
[DOCS] update broken links (#5333)
Co-authored-by: Bhavani Sudha Saktheeswaran <[email protected]>
---
website/docs/clustering.md | 20 ++++++++++----------
website/docs/concurrency_control.md | 10 +++++-----
website/docs/deployment.md | 8 ++++----
website/docs/faq.md | 8 ++++----
website/docs/flink-quick-start-guide.md | 2 +-
website/docs/flink_configuration.md | 2 +-
website/docs/hoodie_cleaner.md | 2 +-
website/docs/hoodie_deltastreamer.md | 8 ++++----
website/docs/key_generation.md | 2 +-
website/docs/metrics.md | 10 +++++-----
website/docs/performance.md | 8 ++++----
website/docs/query_engine_setup.md | 2 +-
website/docs/querying_data.md | 8 ++++----
website/docs/quick-start-guide.md | 12 ++++++------
website/docs/use_cases.md | 4 ++--
website/docs/write_operations.md | 2 +-
website/docs/writing_data.md | 22 +++++++++++-----------
17 files changed, 65 insertions(+), 65 deletions(-)
diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index f210a15b1b..9e157de785 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -12,7 +12,7 @@ Apache Hudi brings stream processing to big data, providing
fresh data while bei
## Clustering Architecture
-At a high level, Hudi provides different operations such as
insert/upsert/bulk_insert through it’s write client API to be able to write
data to a Hudi table. To be able to choose a trade-off between file size and
ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be
able to configure the smallest allowable file size. Users are able to configure
the small file [soft
limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to
`0` to force new data [...]
+At a high level, Hudi provides different operations such as
insert/upsert/bulk_insert through it’s write client API to be able to write
data to a Hudi table. To be able to choose a trade-off between file size and
ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be
able to configure the smallest allowable file size. Users are able to configure
the small file [soft
limit](https://hudi.apache.org/docs/configurations/#hoodieparquetsmallfilelimit)
to `0` to force new [...]
@@ -95,12 +95,12 @@ broadly classified into three types: clustering plan
strategy, execution strateg
This strategy comes into play while creating clustering plan. It helps to
decide what file groups should be clustered.
Let's look at different plan strategies that are available with Hudi. Note
that these strategies are easily pluggable
-using this
[config](/docs/next/configurations#hoodieclusteringplanstrategyclass).
+using this [config](/docs/configurations#hoodieclusteringplanstrategyclass).
1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
- the [small file
limit](/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+ the [small file
limit](/docs/configurations/#hoodieclusteringplanstrategysmallfilelimit)
of base files and creates clustering groups upto max file size allowed per
group. The max size can be specified using
- this
[config](/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup).
This
+ this
[config](/docs/configurations/#hoodieclusteringplanstrategymaxbytespergroup).
This
strategy is useful for stitching together medium-sized files into larger
ones to reduce lot of files spread across
cold partitions.
2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days
partitions and creates a plan that will
@@ -122,12 +122,12 @@ All the strategies are partition-aware and the latter two
are still bound by the
### Execution Strategy
After building the clustering groups in the planning phase, Hudi applies
execution strategy, for each group, primarily
-based on sort columns and size. The strategy can be specified using this
[config](/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+based on sort columns and size. The strategy can be specified using this
[config](/docs/configurations/#hoodieclusteringexecutionstrategyclass).
`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify
the columns to sort the data by, when
clustering using
-this
[config](/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
Apart from
-that, we can also set [max file
size](/docs/next/configurations/#hoodieparquetmaxfilesize)
+this [config](/docs/configurations/#hoodieclusteringplanstrategysortcolumns).
Apart from
+that, we can also set [max file
size](/docs/configurations/#hoodieparquetmaxfilesize)
for the parquet files produced due to clustering. The strategy uses bulk
insert to write data into new files, in which
case, Hudi implicitly uses a partitioner that does sorting based on specified
columns. In this way, the strategy changes
the data layout in a way that not only improves query performance but also
balance rewrite overhead automatically.
@@ -135,19 +135,19 @@ the data layout in a way that not only improves query
performance but also balan
Now this strategy can be executed either as a single spark job or multiple
jobs depending on number of clustering groups
created in the planning phase. By default, Hudi will submit multiple spark
jobs and union the results. In case you want
to force Hudi to use single spark job, set the execution strategy
-class
[config](/docs/next/configurations/#hoodieclusteringexecutionstrategyclass)
+class [config](/docs/configurations/#hoodieclusteringexecutionstrategyclass)
to `SingleSparkJobExecutionStrategy`.
### Update Strategy
Currently, clustering can only be scheduled for tables/partitions not
receiving any concurrent updates. By default,
-the [config for update
strategy](/docs/next/configurations/#hoodieclusteringupdatesstrategy) is
+the [config for update
strategy](/docs/configurations/#hoodieclusteringupdatesstrategy) is
set to ***SparkRejectUpdateStrategy***. If some file group has updates during
clustering then it will reject updates and
throw an exception. However, in some use-cases updates are very sparse and do
not touch most file groups. The default
strategy to simply reject updates does not seem fair. In such use-cases, users
can set the config to ***SparkAllowUpdateStrategy***.
We discussed the critical strategy configurations. All other configurations
related to clustering are
-listed [here](/docs/next/configurations/#Clustering-Configs). Out of this
list, a few
+listed [here](/docs/configurations/#Clustering-Configs). Out of this list, a
few
configurations that will be very useful are:
| Config key | Remarks | Default |
diff --git a/website/docs/concurrency_control.md
b/website/docs/concurrency_control.md
index a9a0d5860c..e71cb4a8f2 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -19,13 +19,13 @@ between multiple table service writers and readers.
Additionally, using MVCC, Hu
the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits
(or writers) happening to the same table, if they do not have writes to
overlapping files being changed, both writers are allowed to succeed.
This feature is currently *experimental* and requires either Zookeeper or
HiveMetastore to acquire locks.
-It may be helpful to understand the different guarantees provided by [write
operations](/docs/writing_data#write-operations) via Hudi datasource or the
delta streamer.
+It may be helpful to understand the different guarantees provided by [write
operations](/docs/write_operations/) via Hudi datasource or the delta streamer.
## Single Writer Guarantees
- *UPSERT Guarantee*: The target table will NEVER show duplicates.
- - *INSERT Guarantee*: The target table wilL NEVER have duplicates if
[dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
- - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if
[dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
+ - *INSERT Guarantee*: The target table wilL NEVER have duplicates if
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is
enabled.
+ - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is
enabled.
- *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER
out of order.
## Multi Writer Guarantees
@@ -33,8 +33,8 @@ It may be helpful to understand the different guarantees
provided by [write oper
With multiple writers using OCC, some of the above guarantees change as follows
- *UPSERT Guarantee*: The target table will NEVER show duplicates.
-- *INSERT Guarantee*: The target table MIGHT have duplicates even if
[dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
-- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if
[dedup](/docs/configurations#INSERT_DROP_DUPS_OPT_KEY) is enabled.
+- *INSERT Guarantee*: The target table MIGHT have duplicates even if
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is
enabled.
+- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is
enabled.
- *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints MIGHT be out
of order due to multiple writer jobs finishing at different times.
## Enabling Multi Writing
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index a33c30a951..739480205d 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -25,9 +25,9 @@ With Merge_On_Read Table, Hudi ingestion needs to also take
care of compacting d
### DeltaStreamer
-[DeltaStreamer](/docs/writing_data#deltastreamer) is the standalone utility to
incrementally pull upstream changes from varied sources such as DFS, Kafka and
DB Changelogs and ingest them to hudi tables. It runs as a spark application in
2 modes.
+[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone
utility to incrementally pull upstream changes from varied sources such as DFS,
Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark
application in 2 modes.
- - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion
round which includes incrementally pulling events from upstream sources and
ingesting them to hudi table. Background operations like cleaning old file
versions and archiving hoodie timeline are automatically executed as part of
the run. For Merge-On-Read tables, Compaction is also run inline as part of
ingestion unless disabled by passing the flag "--disable-compaction". By
default, Compaction is run inline for eve [...]
+ - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion
round which includes incrementally pulling events from upstream sources and
ingesting them to hudi table. Background operations like cleaning old file
versions and archiving hoodie timeline are automatically executed as part of
the run. For Merge-On-Read tables, Compaction is also run inline as part of
ingestion unless disabled by passing the flag "--disable-compaction". By
default, Compaction is run inline for eve [...]
Here is an example invocation for reading from kafka topic in a single-run
mode and writing to Merge On Read table type in a yarn cluster.
@@ -126,7 +126,7 @@ Here is an example invocation for reading from kafka topic
in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Writing Data](/docs/writing_data#datasource-writer), you can
use spark datasource to ingest to hudi table. This mechanism allows you to
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports
spark streaming to ingest a streaming source to Hudi table. For Merge On Read
table types, inline compaction is turned on by default which runs after every
ingestion run. The compaction frequency can be changed by setting the property
"hoodie.compact.inline.ma [...]
+As described in [Writing Data](/docs/writing_data#spark-datasource-writer),
you can use spark datasource to ingest to hudi table. This mechanism allows you
to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also
supports spark streaming to ingest a streaming source to Hudi table. For Merge
On Read table types, inline compaction is turned on by default which runs after
every ingestion run. The compaction frequency can be changed by setting the
property "hoodie.compact.inl [...]
Here is an example invocation using spark datasource
@@ -144,7 +144,7 @@ inputDF.write()
## Upgrading
-New Hudi releases are listed on the [releases page](/releases), with detailed
notes which list all the changes, with highlights in each release.
+New Hudi releases are listed on the [releases page](/releases/download), with
detailed notes which list all the changes, with highlights in each release.
At the end of the day, Hudi is a storage system and with that comes a lot of
responsibilities, which we take seriously.
As general guidelines,
diff --git a/website/docs/faq.md b/website/docs/faq.md
index c675788561..cee9e583e5 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -83,7 +83,7 @@ At a high level, Hudi is based on MVCC design that writes
data to versioned parq
### What are some ways to write a Hudi dataset?
-Typically, you obtain a set of partial updates/inserts from your source and
issue [write operations](https://hudi.apache.org/docs/writing_data/) against a
Hudi dataset. If you ingesting data from any of the standard sources like
Kafka, or tailing DFS, the [delta
streamer](https://hudi.apache.org/docs/writing_data/#deltastreamer) tool is
invaluable and provides an easy, self-managed solution to getting data written
into Hudi. You can also write your own code to capture data from a custom [...]
+Typically, you obtain a set of partial updates/inserts from your source and
issue [write operations](https://hudi.apache.org/docs/write_operations/)
against a Hudi dataset. If you ingesting data from any of the standard sources
like Kafka, or tailing DFS, the [delta
streamer](https://hudi.apache.org/docs/hoodie_deltastreamer#deltastreamer) tool
is invaluable and provides an easy, self-managed solution to getting data
written into Hudi. You can also write your own code to capture data fr [...]
### How is a Hudi job deployed?
@@ -225,7 +225,7 @@ set
hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat
### Can I register my Hudi dataset with Apache Hive metastore?
-Yes. This can be performed either via the standalone [Hive Sync
tool](https://hudi.apache.org/docs/writing_data/#syncing-to-hive) or using
options in
[deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/docker/demo/sparksql-incremental.commands#L50)
tool or
[datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncenable).
+Yes. This can be performed either via the standalone [Hive Sync
tool](https://hudi.apache.org/docs/syncing_metastore#hive-sync-tool) or using
options in
[deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/docker/demo/sparksql-incremental.commands#L50)
tool or
[datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncenable).
### How does the Hudi indexing work & what are its benefits?
@@ -255,7 +255,7 @@ That said, for obvious reasons of not blocking ingesting
for compaction, you may
### What performance/ingest latency can I expect for Hudi writing?
-The speed at which you can write into Hudi depends on the [write
operation](https://hudi.apache.org/docs/writing_data/) and some trade-offs you
make along the way like file sizing. Just like how databases incur overhead
over direct/raw file I/O on disks, Hudi operations may have overhead from
supporting database like features compared to reading/writing raw DFS files.
That said, Hudi implements advanced techniques from database literature to keep
these minimal. User is encouraged to ha [...]
+The speed at which you can write into Hudi depends on the [write
operation](https://hudi.apache.org/docs/write_operations) and some trade-offs
you make along the way like file sizing. Just like how databases incur overhead
over direct/raw file I/O on disks, Hudi operations may have overhead from
supporting database like features compared to reading/writing raw DFS files.
That said, Hudi implements advanced techniques from database literature to keep
these minimal. User is encouraged to [...]
| Storage Type | Type of workload | Performance | Tips |
|-------|--------|--------|--------|
@@ -364,7 +364,7 @@ spark.read.parquet("your_data_set/path/to/month").limit(n)
// Limit n records
.save(basePath);
```
-For merge on read table, you may want to also try scheduling and running
compaction jobs. You can run compaction directly using spark submit on
org.apache.hudi.utilities.HoodieCompactor or by using [HUDI
CLI](https://hudi.apache.org/docs/deployment/#cli).
+For merge on read table, you may want to also try scheduling and running
compaction jobs. You can run compaction directly using spark submit on
org.apache.hudi.utilities.HoodieCompactor or by using [HUDI
CLI](https://hudi.apache.org/docs/cli).
### If I keep my file versions at 1, with this configuration will i be able to
do a roll back (to the last commit) when write fail?
diff --git a/website/docs/flink-quick-start-guide.md
b/website/docs/flink-quick-start-guide.md
index a723b8ed7b..daec4ba0b5 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -31,7 +31,7 @@ Start a standalone Flink cluster within hadoop environment.
Before you start up the cluster, we suggest to config the cluster as follows:
- in `$FLINK_HOME/conf/flink-conf.yaml`, add config option
`taskmanager.numberOfTaskSlots: 4`
-- in `$FLINK_HOME/conf/flink-conf.yaml`, [add other global configurations
according to the characteristics of your task](#flink-configuration)
+- in `$FLINK_HOME/conf/flink-conf.yaml`, [add other global configurations
according to the characteristics of your
task](flink_configuration#global-configurations)
- in `$FLINK_HOME/conf/workers`, add item `localhost` as 4 lines so that there
are 4 workers on the local cluster
Now starts the cluster:
diff --git a/website/docs/flink_configuration.md
b/website/docs/flink_configuration.md
index ba7853d7cd..d615281a6b 100644
--- a/website/docs/flink_configuration.md
+++ b/website/docs/flink_configuration.md
@@ -60,7 +60,7 @@ allocated with enough memory, we can try to set these memory
options.
| `write.bucket_assign.tasks` | The parallelism of bucket assigner
operators. No default value, using Flink `parallelism.default` |
[`parallelism.default`](#parallelism) | Increases the parallelism also
increases the number of buckets, thus the number of small files (small buckets)
|
| `write.index_boostrap.tasks` | The parallelism of index bootstrap.
Increasing parallelism can speed up the efficiency of the bootstrap stage. The
bootstrap stage will block checkpointing. Therefore, it is necessary to set
more checkpoint failure tolerance times. Default using Flink
`parallelism.default` | [`parallelism.default`](#parallelism) | It only take
effect when `index.bootsrap.enabled` is `true` |
| `read.tasks` | The parallelism of read operators (batch and stream). Default
`4` | `4` | |
-| `compaction.tasks` | The parallelism of online compaction. Default `4` | `4`
| `Online compaction` will occupy the resources of the write task. It is
recommended to use [`offline compaction`](#offline-compaction) |
+| `compaction.tasks` | The parallelism of online compaction. Default `4` | `4`
| `Online compaction` will occupy the resources of the write task. It is
recommended to use [`offline
compaction`](/docs/compaction/#flink-offline-compaction) |
### Compaction
diff --git a/website/docs/hoodie_cleaner.md b/website/docs/hoodie_cleaner.md
index 41956f566c..10f1aa2450 100644
--- a/website/docs/hoodie_cleaner.md
+++ b/website/docs/hoodie_cleaner.md
@@ -47,7 +47,7 @@ hoodie.clean.async=true
```
### CLI
-You can also use [Hudi CLI](https://hudi.apache.org/docs/deployment#cli) to
run Hoodie Cleaner.
+You can also use [Hudi CLI](/docs/cli) to run Hoodie Cleaner.
CLI provides the below commands for cleaner service:
- `cleans show`
diff --git a/website/docs/hoodie_deltastreamer.md
b/website/docs/hoodie_deltastreamer.md
index f212f57859..3c49bd2bbf 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -374,7 +374,7 @@ frequent `file handle` switching.
:::note
The parallelism of `bulk_insert` is specified by `write.tasks`. The
parallelism will affect the number of small files.
In theory, the parallelism of `bulk_insert` is the number of `bucket`s (In
particular, when each bucket writes to maximum file size, it
-will rollover to the new file handle. Finally, `the number of files` >=
[`write.bucket_assign.tasks`](#parallelism)).
+will rollover to the new file handle. Finally, `the number of files` >=
[`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks).
:::
#### Options
@@ -382,9 +382,9 @@ will rollover to the new file handle. Finally, `the number
of files` >= [`write.
| Option Name | Required | Default | Remarks |
| ----------- | ------- | ------- | ------- |
| `write.operation` | `true` | `upsert` | Setting as `bulk_insert` to open
this function |
-| `write.tasks` | `false` | `4` | The parallelism of `bulk_insert`, `the
number of files` >= [`write.bucket_assign.tasks`](#parallelism) |
-| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to
shuffle data according to the partition field before writing. Enabling this
option will reduce the number of small files, but there may be a risk of data
skew |
-| `write.bulk_insert.sort_by_partition` | `false` | `true` | Whether to sort
data according to the partition field before writing. Enabling this option will
reduce the number of small files when a write task writes multiple partitions |
+| `write.tasks` | `false` | `4` | The parallelism of `bulk_insert`, `the
number of files` >=
[`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks) |
+| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to
shuffle data according to the partition field before writing. Enabling this
option will reduce the number of small files, but there may be a risk of data
skew |
+| `write.bulk_insert.sort_by_partition` | `false` | `true` | Whether to sort
data according to the partition field before writing. Enabling this option will
reduce the number of small files when a write task writes multiple partitions |
| `write.sort.memory` | `false` | `128` | Available managed memory of sort
operator. default `128` MB |
### Index Bootstrap
diff --git a/website/docs/key_generation.md b/website/docs/key_generation.md
index f20e4d77a1..1dcb020645 100644
--- a/website/docs/key_generation.md
+++ b/website/docs/key_generation.md
@@ -17,7 +17,7 @@ Hudi provides several key generators out of the box that
users can use based on
implementation for users to implement and use their own KeyGenerator. This
page goes over all different types of key
generators that are readily available to use.
-[Here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java)
+[Here](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java)
is the interface for KeyGenerator in Hudi for your reference.
Before diving into different types of key generators, let’s go over some of
the common configs required to be set for
diff --git a/website/docs/metrics.md b/website/docs/metrics.md
index 17441447fa..4a831d7981 100644
--- a/website/docs/metrics.md
+++ b/website/docs/metrics.md
@@ -6,7 +6,7 @@ toc: true
last_modified_at: 2020-06-20T15:59:57-04:00
---
-In this section, we will introduce the `MetricsReporter` and `HoodieMetrics`
in Hudi. You can view the metrics-related configurations
[here](configurations#metrics-configs).
+In this section, we will introduce the `MetricsReporter` and `HoodieMetrics`
in Hudi. You can view the metrics-related configurations
[here](configurations#METRICS).
## MetricsReporter
@@ -17,7 +17,7 @@ MetricsReporter provides APIs for reporting `HoodieMetrics`
to user-specified ba
JmxMetricsReporter is an implementation of JMX reporter, which used to report
JMX metrics.
#### Configurations
-The following is an example of `JmxMetricsReporter`. More detaile
configurations can be referenced [here](configurations#jmx).
+The following is an example of `JmxMetricsReporter`. More detailed
configurations can be referenced
[here](configurations#Metrics-Configurations-for-Jmx).
```properties
hoodie.metrics.on=true
@@ -37,7 +37,7 @@ As configured above, JmxMetricsReporter will started JMX
server on port 4001. We
MetricsGraphiteReporter is an implementation of Graphite reporter, which
connects to a Graphite server, and send `HoodieMetrics` to it.
#### Configurations
-The following is an example of `MetricsGraphiteReporter`. More detaile
configurations can be referenced [here](configurations#graphite).
+The following is an example of `MetricsGraphiteReporter`. More detaile
configurations can be referenced
[here](configurations#Metrics-Configurations-for-Graphite).
```properties
hoodie.metrics.on=true
@@ -58,7 +58,7 @@ DatadogMetricsReporter is an implementation of Datadog
reporter.
A reporter which publishes metric values to Datadog monitoring service via
Datadog HTTP API.
#### Configurations
-The following is an example of `DatadogMetricsReporter`. More detailed
configurations can be referenced [here](configurations#datadog).
+The following is an example of `DatadogMetricsReporter`. More detailed
configurations can be referenced
[here](configurations#Metrics-Configurations-for-Datadog-reporter).
```properties
hoodie.metrics.on=true
@@ -138,7 +138,7 @@ tuned are in the `HoodieMetricsCloudWatchConfig` class.
Allows users to define a custom metrics reporter.
#### Configurations
-The following is an example of `UserDefinedMetricsReporter`. More detailed
configurations can be referenced [here](configurations#user-defined-reporter).
+The following is an example of `UserDefinedMetricsReporter`. More detailed
configurations can be referenced [here](configurations#Metrics-Configurations).
```properties
hoodie.metrics.on=true
diff --git a/website/docs/performance.md b/website/docs/performance.md
index 53152730bd..db78a7f25b 100644
--- a/website/docs/performance.md
+++ b/website/docs/performance.md
@@ -14,12 +14,12 @@ column statistics etc. Even on some cloud data stores,
there is often cost to li
Here are some ways to efficiently manage the storage of your Hudi tables.
-- The [small file handling
feature](/docs/configurations#compactionSmallFileSize) in Hudi, profiles
incoming workload
+- The [small file handling
feature](/docs/configurations/#hoodieparquetsmallfilelimit) in Hudi, profiles
incoming workload
and distributes inserts to existing file groups instead of creating new file
groups, which can lead to small files.
-- Cleaner can be [configured](/docs/configurations#retainCommits) to clean up
older file slices, more or less aggressively depending on maximum time for
queries to run & lookback needed for incremental pull
-- User can also tune the size of the [base/parquet
file](/docs/configurations#limitFileSize), [log
files](/docs/configurations#logFileMaxSize) & expected [compression
ratio](/docs/configurations#parquetCompressionRatio),
+- Cleaner can be
[configured](/docs/configurations#hoodiecleanercommitsretained) to clean up
older file slices, more or less aggressively depending on maximum time for
queries to run & lookback needed for incremental pull
+- User can also tune the size of the [base/parquet
file](/docs/configurations#hoodieparquetmaxfilesize), [log
files](/docs/configurations#hoodielogfilemaxsize) & expected [compression
ratio](/docs/configurations#hoodieparquetcompressionratio),
such that sufficient number of inserts are grouped into the same file group,
resulting in well sized base files ultimately.
-- Intelligently tuning the [bulk insert
parallelism](/docs/configurations#withBulkInsertParallelism), can again in
nicely sized initial file groups. It is in fact critical to get this right,
since the file groups
+- Intelligently tuning the [bulk insert
parallelism](/docs/configurations#hoodiebulkinsertshuffleparallelism), can
again in nicely sized initial file groups. It is in fact critical to get this
right, since the file groups
once created cannot be deleted, but simply expanded as explained before.
- For workloads with heavy updates, the [merge-on-read
table](/docs/concepts#merge-on-read-table) provides a nice mechanism for
ingesting quickly into smaller files and then later merging them into larger
base files via compaction.
diff --git a/website/docs/query_engine_setup.md
b/website/docs/query_engine_setup.md
index d89a96d042..8d555dae3e 100644
--- a/website/docs/query_engine_setup.md
+++ b/website/docs/query_engine_setup.md
@@ -64,7 +64,7 @@ To query Hudi tables on Trino, please place the
`hudi-presto-bundle` jar into th
## Hive
In order for Hive to recognize Hudi tables and query correctly,
-- the HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf#concept_nc3_mms_lr).
This will ensure the input format
+- the HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
This will ensure the input format
classes with its dependencies are available for query planning & execution.
- For MERGE_ON_READ tables, additionally the bundle needs to be put on the
hadoop/hive installation across the cluster, so that queries can pick up the
custom RecordReader as well.
diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md
index c516708e7d..1b5cee0d5b 100644
--- a/website/docs/querying_data.md
+++ b/website/docs/querying_data.md
@@ -49,7 +49,7 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon,
begin_lat, ts from hu
```
For examples, refer to [Incremental
Queries](/docs/quick-start-guide#incremental-query) in the Spark quickstart.
-Please refer to [configurations](/docs/configurations#spark-datasource)
section, to view all datasource options.
+Please refer to [configurations](/docs/configurations#SPARK_DATASOURCE)
section, to view all datasource options.
Additionally, `HoodieReadClient` offers the following functionality using
Hudi's implicit indexing.
@@ -170,16 +170,16 @@ would ensure Map Reduce execution is chosen for a Hive
query, which combines par
separated) and calls InputFormat.listStatus() only once with all those
partitions.
## PrestoDB
-To setup PrestoDB for querying Hudi, see the [Query Engine
Setup](/docs/query_engine_setup#PrestoDB) page.
+To setup PrestoDB for querying Hudi, see the [Query Engine
Setup](/docs/query_engine_setup#prestodb) page.
## Trino
-To setup Trino for querying Hudi, see the [Query Engine
Setup](/docs/query_engine_setup#Trino) page.
+To setup Trino for querying Hudi, see the [Query Engine
Setup](/docs/query_engine_setup#trino) page.
## Impala (3.4 or later)
### Snapshot Query
-Impala is able to query Hudi Copy-on-write table as an [EXTERNAL
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables#external_tables)
on HDFS.
+Impala is able to query Hudi Copy-on-write table as an [EXTERNAL
TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
on HDFS.
To create a Hudi read optimized table on Impala:
```
diff --git a/website/docs/quick-start-guide.md
b/website/docs/quick-start-guide.md
index 6446016254..51d0b838f7 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -412,12 +412,12 @@ df.write.format("hudi").
:::info
`mode(Overwrite)` overwrites and recreates the table if it already exists.
You can check the data generated under
`/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
-(`uuid` in
[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)),
partition field (`region/country/city`) and combine logic (`ts` in
-[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58))
to ensure trip records are unique within each partition. For more info, refer
to
-[Modeling data stored in
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+(`uuid` in
[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)),
partition field (`region/country/city`) and combine logic (`ts` in
+[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60))
to ensure trip records are unique within each partition. For more info, refer
to
+[Modeling data stored in
Hudi](https://hudi.apache.org/learn/faq/#how-do-i-model-the-data-stored-in-hudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi
Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a
workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/write_operations)
:::
</TabItem>
@@ -453,7 +453,7 @@ You can check the data generated under
`/tmp/hudi_trips_cow/<region>/<country>/<
[Modeling data stored in
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi
Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a
workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/write_operations)
:::
</TabItem>
@@ -1117,7 +1117,7 @@ more details please refer to
[procedures](/docs/next/procedures).
You can also do the quickstart by [building hudi
yourself](https://github.com/apache/hudi#building-apache-hudi-from-source),
and using `--jars <path to
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.1?-*.*.*-SNAPSHOT.jar`
in the spark-shell command above
-instead of `--packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1`.
Hudi also supports scala 2.12. Refer [build with scala
2.12](https://github.com/apache/hudi#build-with-scala-212)
+instead of `--packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1`.
Hudi also supports scala 2.12. Refer [build with scala
2.12](https://github.com/apache/hudi#build-with-different-spark-versions)
for more info.
Also, we used Spark here to show case the capabilities of Hudi. However, Hudi
can support multiple table types/query types and
diff --git a/website/docs/use_cases.md b/website/docs/use_cases.md
index 3758d7208e..f3fabdf04d 100644
--- a/website/docs/use_cases.md
+++ b/website/docs/use_cases.md
@@ -15,7 +15,7 @@ This blog post outlines this use case in more depth -
https://hudi.apache.org/bl
### Near Real-Time Ingestion
-Ingesting data from OLTP sources like (event logs, databases, external
sources) into a [Data Lake](http://martinfowler.com/bliki/DataLake) is a common
problem,
+Ingesting data from OLTP sources like (event logs, databases, external
sources) into a [Data Lake](http://martinfowler.com/bliki/DataLake.html) is a
common problem,
that is unfortunately solved in a piecemeal fashion, using a medley of
ingestion tools. This "raw data" layer of the data lake often forms the bedrock
on which
more value is created.
@@ -27,7 +27,7 @@ even moderately big installations store billions of rows. It
goes without saying
are needed if ingestion is to keep up with the typically high update volumes.
Even for immutable data sources like [Kafka](https://kafka.apache.org), there
is often a need to de-duplicate the incoming events against what's stored on
DFS.
-Hudi achieves this by [employing
indexes](http://hudi.apache.org/blog/hudi-indexing-mechanisms/) of different
kinds, quickly and efficiently.
+Hudi achieves this by [employing
indexes](http://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) of
different kinds, quickly and efficiently.
All of this is seamlessly achieved by the Hudi DeltaStreamer tool, which is
maintained in tight integration with rest of the code
and we are always trying to add more capture sources, to make this easier for
the users. The tool also has a continuous mode, where it
diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md
index ccdac23350..746a93d057 100644
--- a/website/docs/write_operations.md
+++ b/website/docs/write_operations.md
@@ -37,7 +37,7 @@ Hudi supports implementing two types of deletes on data
stored in Hudi tables, b
## Writing path
The following is an inside look on the Hudi write path and the sequence of
events that occur during a write.
-1. [Deduping](/docs/configurations/#writeinsertdeduplicate)
+1. [Deduping](/docs/configurations#hoodiecombinebeforeinsert)
1. First your input records may have duplicate keys within the same batch
and duplicates need to be combined or reduced by key.
2. [Index Lookup](/docs/next/indexing)
1. Next, an index lookup is performed to try and match the input records to
identify which file groups they belong to.
diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md
index 15fcc4d66b..8765222b21 100644
--- a/website/docs/writing_data.md
+++ b/website/docs/writing_data.md
@@ -9,7 +9,7 @@ import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
In this section, we will cover ways to ingest new changes from external
sources or even other Hudi tables.
-The two main tools available are the [DeltaStreamer](#deltastreamer) tool, as
well as the [Spark Hudi datasource](#datasource-writer).
+The two main tools available are the
[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) tool, as well as the
[Spark Hudi datasource](#spark-datasource-writer).
## Spark Datasource Writer
@@ -31,7 +31,7 @@ Default value: `"partitionpath"`<br/>
**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records within the same
batch have the same key value, the record with the largest value from the field
specified will be choosen. If you are using default payload of
OverwriteWithLatestAvroPayload for HoodieRecordPayload (`WRITE_PAYLOAD_CLASS`),
an incoming record will always takes precendence compared to the one in storage
ignoring this `PRECOMBINE_FIELD_OPT_KEY`. <br/>
Default value: `"ts"`<br/>
-**OPERATION_OPT_KEY**: The [write operations](#write-operations) to use.<br/>
+**OPERATION_OPT_KEY**: The [write operations](/docs/write_operations) to
use.<br/>
Available values:<br/>
`UPSERT_OPERATION_OPT_VAL` (default), `BULK_INSERT_OPERATION_OPT_VAL`,
`INSERT_OPERATION_OPT_VAL`, `DELETE_OPERATION_OPT_VAL`
@@ -39,7 +39,7 @@ Available values:<br/>
Available values:<br/>
[`COW_TABLE_TYPE_OPT_VAL`](/docs/concepts#copy-on-write-table) (default),
[`MOR_TABLE_TYPE_OPT_VAL`](/docs/concepts#merge-on-read-table)
-**KEYGENERATOR_CLASS_OPT_KEY**: Refer to [Key Generation](#key-generation)
section below.
+**KEYGENERATOR_CLASS_OPT_KEY**: Refer to [Key
Generation](/docs/key_generation) section below.
**HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY**: If using hive, specify if the
table should or should not be partitioned.<br/>
Available values:<br/>
@@ -88,12 +88,12 @@ df.write.format("hudi").
:::info
`mode(Overwrite)` overwrites and recreates the table if it already exists.
You can check the data generated under
`/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
-(`uuid` in
[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)),
partition field (`region/country/city`) and combine logic (`ts` in
-[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58))
to ensure trip records are unique within each partition. For more info, refer
to
-[Modeling data stored in
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+(`uuid` in
[schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)),
partition field (`region/country/city`) and combine logic (`ts` in
+[schema](https://github.com/apache/hudi/blob/6f9b02decb5bb2b83709b1b6ec04a97e4d102c11/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60))
to ensure trip records are unique within each partition. For more info, refer
to
+[Modeling data stored in
Hudi](https://hudi.apache.org/learn/faq/#how-do-i-model-the-data-stored-in-hudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi
Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a
workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/write_operations)
:::
</TabItem>
@@ -124,12 +124,12 @@ df.write.format("hudi").
:::info
`mode(Overwrite)` overwrites and recreates the table if it already exists.
You can check the data generated under
`/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key
-(`uuid` in
[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)),
partition field (`region/country/city`) and combine logic (`ts` in
-[schema](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58))
to ensure trip records are unique within each partition. For more info, refer
to
-[Modeling data stored in
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+(`uuid` in
[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60)),
partition field (`region/country/city`) and combine logic (`ts` in
+[schema](https://github.com/apache/hudi/blob/2e6e302efec2fa848ded4f88a95540ad2adb7798/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L60))
to ensure trip records are unique within each partition. For more info, refer
to
+[Modeling data stored in
Hudi](https://hudi.apache.org/learn/faq/#how-do-i-model-the-data-stored-in-hudi)
and for info on ways to ingest data into Hudi, refer to [Writing Hudi
Tables](/docs/writing_data).
Here we are using the default write operation : `upsert`. If you have a
workload without updates, you can also issue
-`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/writing_data#write-operations)
+`insert` or `bulk_insert` operations which could be faster. To know more,
refer to [Write operations](/docs/write_operations)
:::
</TabItem>