This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8bb5c89239f4 chore: update applicable jira links to gh issues (#17470)
8bb5c89239f4 is described below
commit 8bb5c89239f4fa76ab33e46727766ac39aeb85a4
Author: Shiyan Xu <[email protected]>
AuthorDate: Wed Dec 3 13:27:23 2025 -0600
chore: update applicable jira links to gh issues (#17470)
---
website/.markdownlint.json | 1 +
website/blog/2021-08-23-s3-events-source.md | 2 +-
website/docs/comparison.md | 4 +-
website/docs/hudi_stack.md | 4 +-
website/docs/metadata.md | 4 +-
website/docs/metadata_indexing.md | 5 +-
website/docs/sql_queries.md | 5 +-
website/releases/release-0.14.0.md | 4 +-
website/releases/release-1.0.0-beta1.md | 5 +-
website/releases/release-1.0.0-beta2.md | 2 +-
website/releases/release-1.0.0.md | 3 +-
website/src/pages/faq/writing_tables.md | 4 +-
website/src/pages/roadmap.md | 66 +++++++++++-----------
.../versioned_docs/version-0.14.0/comparison.md | 2 +-
website/versioned_docs/version-0.14.0/faq.md | 4 +-
.../version-0.14.0/metadata_indexing.md | 2 +-
.../versioned_docs/version-0.14.0/sql_queries.md | 5 +-
.../versioned_docs/version-0.14.1/comparison.md | 2 +-
.../version-0.14.1/faq_writing_tables.md | 4 +-
.../version-0.14.1/metadata_indexing.md | 2 +-
.../versioned_docs/version-0.14.1/sql_queries.md | 5 +-
.../versioned_docs/version-0.15.0/comparison.md | 2 +-
.../version-0.15.0/faq_writing_tables.md | 4 +-
.../versioned_docs/version-0.15.0/hudi_stack.md | 4 +-
website/versioned_docs/version-0.15.0/metadata.md | 2 +-
.../version-0.15.0/metadata_indexing.md | 2 +-
.../versioned_docs/version-0.15.0/sql_queries.md | 6 +-
website/versioned_docs/version-1.0.0/comparison.md | 2 +-
.../version-1.0.0/faq_writing_tables.md | 4 +-
website/versioned_docs/version-1.0.0/hudi_stack.md | 4 +-
website/versioned_docs/version-1.0.0/metadata.md | 2 +-
.../version-1.0.0/metadata_indexing.md | 2 +-
.../versioned_docs/version-1.0.0/sql_queries.md | 6 +-
website/versioned_docs/version-1.0.1/comparison.md | 2 +-
.../version-1.0.1/faq_writing_tables.md | 4 +-
website/versioned_docs/version-1.0.1/hudi_stack.md | 4 +-
website/versioned_docs/version-1.0.1/metadata.md | 2 +-
.../version-1.0.1/metadata_indexing.md | 2 +-
.../versioned_docs/version-1.0.1/sql_queries.md | 2 +-
website/versioned_docs/version-1.0.2/comparison.md | 2 +-
.../version-1.0.2/faq_writing_tables.md | 4 +-
website/versioned_docs/version-1.0.2/hudi_stack.md | 4 +-
website/versioned_docs/version-1.0.2/metadata.md | 2 +-
.../version-1.0.2/metadata_indexing.md | 2 +-
.../versioned_docs/version-1.0.2/sql_queries.md | 2 +-
website/versioned_docs/version-1.1.0/comparison.md | 2 +-
website/versioned_docs/version-1.1.0/hudi_stack.md | 4 +-
website/versioned_docs/version-1.1.0/metadata.md | 2 +-
.../version-1.1.0/metadata_indexing.md | 2 +-
.../versioned_docs/version-1.1.0/sql_queries.md | 2 +-
50 files changed, 104 insertions(+), 115 deletions(-)
diff --git a/website/.markdownlint.json b/website/.markdownlint.json
index efe0e97b5213..3d9121c6ee43 100644
--- a/website/.markdownlint.json
+++ b/website/.markdownlint.json
@@ -6,5 +6,6 @@
"style": "ordered"
},
"MD036": false,
+ "MD040": false,
"MD041": false
}
diff --git a/website/blog/2021-08-23-s3-events-source.md
b/website/blog/2021-08-23-s3-events-source.md
index 79989ef9deda..682543aa3d72 100644
--- a/website/blog/2021-08-23-s3-events-source.md
+++ b/website/blog/2021-08-23-s3-events-source.md
@@ -116,7 +116,7 @@ This post introduced a log-based approach to ingest data
from S3 into Hudi table
- Another stream of work is to add resource manager that allows users to setup
notifications and delete resources when no longer needed.
- Another interesting piece of work is to support **asynchronous backfills**.
Notification systems are evntually consistent and typically do not guarantee
perfect delivery of all files right away. The log-based approach provides
enough flexibility to trigger automatic backfills at a configurable interval
e.g. once a day or once a week.
-Please follow this [JIRA](https://issues.apache.org/jira/browse/HUDI-1896) to
learn more about active development on this issue.
+Please follow this [GitHub issue](https://github.com/apache/hudi/issues/14794)
to learn more about active development on this issue.
We look forward to contributions from the community. Hope you enjoyed this
post.
Put your Hudi on and keep streaming!
diff --git a/website/docs/comparison.md b/website/docs/comparison.md
index 0bcce2ace532..3d2ef46a77e6 100644
--- a/website/docs/comparison.md
+++ b/website/docs/comparison.md
@@ -14,7 +14,6 @@ and bring out the different tradeoffs these systems have
accepted in their desig
class support for `upserts`. A key differentiator is that Kudu also attempts
to serve as a datastore for OLTP workloads, something that Hudi does not aspire
to be.
Consequently, Kudu does not support incremental pulling (as of early 2017),
something Hudi does to enable incremental processing use cases.
-
Kudu diverges from a distributed file system abstraction and HDFS altogether,
with its own set of storage servers talking to each other via RAFT.
Hudi, on the other hand, is designed to work with an underlying Hadoop
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of
storage servers,
instead relying on Apache Spark to do the heavy-lifting. Thus, Hudi can be
scaled easily, just like other Spark jobs, while Kudu would require hardware
@@ -22,7 +21,6 @@ instead relying on Apache Spark to do the heavy-lifting.
Thus, Hudi can be scale
But, if we were to go with results shared by
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
,
we expect Hudi to positioned at something that ingests parquet with superior
performance.
-
## Hive Transactions
[Hive
Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)
is another similar effort, which tries to implement storage like
@@ -53,4 +51,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/docs/hudi_stack.md b/website/docs/hudi_stack.md
index 67e2bdbb1b8f..fbc3a05e7930 100644
--- a/website/docs/hudi_stack.md
+++ b/website/docs/hudi_stack.md
@@ -141,7 +141,7 @@ is introduced, allowing multiple writers to concurrently
operate on the table wi

<p align = "center">Proposed Lake Cache in Hudi</p>
-Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
+Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
## Programming APIs
@@ -192,5 +192,5 @@ interactive engines such as Trino and Presto.
Storing table metadata on lake storage, while scalable, is less efficient than
RPCs to a scalable meta server. Hudi addresses this with its metadata server,
called "metaserver,"
an efficient alternative for managing table metadata for a large number of
tables. Currently, the timeline server, embedded in Hudi's writer processes,
uses a local rocksDB store and [Javalin](https://javalin.io/) REST API to serve
file listings, reducing cloud storage listings.
-Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://issues.apache.org/jira/browse/HUDI-3345)
+Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://github.com/apache/hudi/issues/15011)
for future needs.
diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index fe8827ebeec5..0f3cee915985 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -92,6 +92,7 @@ cleaned up, before re-enabling the metadata table again.
## Leveraging metadata during queries
### files index
+
Metadata based listing using *files_index* can be leveraged on the read side
by setting appropriate configs/session properties
from different engines as shown below:
@@ -100,10 +101,11 @@ from different engines as shown below:
| Spark DataSource, Spark SQL, Strucured Streaming | hoodie.metadata.enable |
When set to `true` enables use of the spark file index implementation for Hudi,
that speeds up listing of large tables.<br /> |
| Flink DataStream, Flink SQL | metadata.enabled | When set to
`true` from DDL uses the internal metadata table to serves table metadata like
level file listings |
| Presto |
[hudi.metadata-table-enabled](https://prestodb.io/docs/current/connector/hudi.html)
| When set to `true` fetches the list of file names and sizes from
Hudi’s metadata table rather than storage. |
-| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://issues.apache.org/jira/browse/HUDI-7020). |
+| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://github.com/apache/hudi/issues/16286). |
| Athena |
[hudi.metadata-listing-enabled](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
| When this table property is set to `TRUE` enables the Hudi metadata table
and the related file listing functionality |
### column_stats index and data skipping
+
Enabling metadata table and column stats index is a prerequisite to enabling
data skipping capabilities. Following are the
corresponding configs across Spark and Flink readers.
diff --git a/website/docs/metadata_indexing.md
b/website/docs/metadata_indexing.md
index 86df7c58061c..d22264e21fe7 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -310,9 +310,10 @@ Asynchronous indexing feature is still evolving. Few
points to note from deploym
think that particular index was disabled and cleanup the metadata partition.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
## Related Resources
+
<h3>Videos</h3>
-* [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi Hands on
Lab](https://www.youtube.com/watch?v=TSphQCsY4pY)
+- [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi Hands on
Lab](https://www.youtube.com/watch?v=TSphQCsY4pY)
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index aba1d5845c5b..2b45753c8770 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -552,14 +552,13 @@ Please check the below table for query types supported
and installation instruct
| > = 0.272 | No action needed. Hudi 0.10.1 version is a compile
time dependency. | File listing optimizations. Improved query performance. |
| > = 0.275 | No action needed. Hudi 0.11.0 version is a compile
time dependency. | All of the above. Native Hudi connector that is on par with
Hive connector. |
-
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
-To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
+To use the Hudi connector, please configure hudi catalog in
`/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
```properties
connector.name=hudi
diff --git a/website/releases/release-0.14.0.md
b/website/releases/release-0.14.0.md
index 58c786ad462d..c0276c3b5697 100644
--- a/website/releases/release-0.14.0.md
+++ b/website/releases/release-0.14.0.md
@@ -259,7 +259,7 @@ significantly reduce read latencies by 20 to 40% when
compared to the older file
bootstrap queries. The goal is to bring the latencies closer to those of the
COW (Copy On Write) file format. To utilize
this new file format, users need to set
`hoodie.datasource.read.use.new.parquet.file.format=true`. It's important to
note
that this feature is still experimental and comes with a few limitations. For
more details and if you're interested in
-contributing, please refer to
[HUDI-6568](https://issues.apache.org/jira/browse/HUDI-6568).
+contributing, please refer to [this GitHub
issue](https://github.com/apache/hudi/issues/16112).
### Spark write side improvements
@@ -332,7 +332,7 @@ compaction, clustering, and metadata table support has been
added to Java Engine
In Hudi 0.14.0, when querying a table that uses ComplexKeyGenerator or
CustomKeyGenerator, partition values are returned
as string. Note that there is no type change on the storage i.e. partition
fields are written in the user-defined type
on storage. However, this is a breaking change for the aforementioned key
generators and will be fixed in 0.14.1 -
-[HUDI-6914](https://issues.apache.org/jira/browse/HUDI-6914)
+[tracking issue](https://github.com/apache/hudi/issues/16251)
## Raw Release Notes
diff --git a/website/releases/release-1.0.0-beta1.md
b/website/releases/release-1.0.0-beta1.md
index 771fcc102025..f394f58c73b3 100644
--- a/website/releases/release-1.0.0-beta1.md
+++ b/website/releases/release-1.0.0-beta1.md
@@ -28,7 +28,7 @@ rolling upgrades from older versions to this release.
### Format changes
-[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic
covering all the format changes proposals,
+[This GitHub issue](https://github.com/apache/hudi/issues/15964) is the main
epic covering all the format changes proposals,
which are also partly covered in the [Hudi 1.0 tech
specification](/learn/tech-specs-1point0). The following are the main
changes in this release:
@@ -129,12 +129,13 @@ hoodie.merge.use.record.positions=true
```
Few things to note for the new reader:
+
- It is only applicable to COW or MOR tables with base files in Parquet format.
- Only snapshot queries for COW table, and snapshot queries and read-optimized
queries for MOR table are supported.
- Currently, the reader will not be able to push down the data filters to
scan. It is recommended to use key-based
merging for now.
-You can follow [HUDI-6243](https://issues.apache.org/jira/browse/HUDI-6243)
+You can follow [this GitHub issue](https://github.com/apache/hudi/issues/15965)
and [HUDI-6722](https://issues.apache.org/jira/browse/HUDI-6722) to keep track
of ongoing work related to reader/writer
API changes and performance improvements.
diff --git a/website/releases/release-1.0.0-beta2.md
b/website/releases/release-1.0.0-beta2.md
index 35748b0d7d6d..16f329623cb1 100644
--- a/website/releases/release-1.0.0-beta2.md
+++ b/website/releases/release-1.0.0-beta2.md
@@ -27,7 +27,7 @@ rolling upgrades from older versions to this release.
### Format changes
-[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic
covering all the format changes proposals,
+[This GitHub issue](https://github.com/apache/hudi/issues/15964) is the main
epic covering all the format changes proposals,
which are also partly covered in the [Hudi 1.0 tech
specification](/learn/tech-specs-1point0). The following are the main
changes in this release:
diff --git a/website/releases/release-1.0.0.md
b/website/releases/release-1.0.0.md
index 60f0b1e99632..a5a2e824fb5b 100644
--- a/website/releases/release-1.0.0.md
+++ b/website/releases/release-1.0.0.md
@@ -41,8 +41,7 @@ and
[RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#sup
### Format changes
-The main epic covering all the format changes is
[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242), which is also
-covered in the [Hudi 1.0 tech specification](/learn/tech-specs-1point0). The
following are the main highlights with respect to format changes:
+The main epic covering all the format changes is [this GitHub
issue](https://github.com/apache/hudi/issues/15964), which is also covered in
the [Hudi 1.0 tech specification](/learn/tech-specs-1point0). The following are
the main highlights with respect to format changes:
#### Timeline
diff --git a/website/src/pages/faq/writing_tables.md
b/website/src/pages/faq/writing_tables.md
index e06143d6227d..961b8aa4eb64 100644
--- a/website/src/pages/faq/writing_tables.md
+++ b/website/src/pages/faq/writing_tables.md
@@ -125,8 +125,8 @@ The speed at which you can write into Hudi depends on the
[write operation](/doc
| ---| ---| ---| --- |
| copy on write | bulk_insert | Should match vanilla spark writing + an
additional sort to properly size files | properly size [bulk insert
parallelism](/docs/configurations#hoodiebulkinsertshuffleparallelism) to get
right number of files. Use insert if you want this auto tuned. Configure
[hoodie.bulkinsert.sort.mode](/docs/configurations#hoodiebulkinsertsortmode)
for better file sizes at the cost of memory. The default value `NONE` offers
the fastest performance and matches `spark.write [...]
| copy on write | insert | Similar to bulk insert, except the file sizes are
auto tuned requiring input to be cached into memory and custom partitioned. |
Performance would be bound by how parallel you can write the ingested data.
Tune [this limit](/docs/configurations#hoodieinsertshuffleparallelism) up, if
you see that writes are happening from only a few executors. |
-| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
-| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add
support for logging inserts directly and this up drastically. | |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
+| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [This GitHub issue](https://github.com/apache/hudi/issues/14468)
will add support for logging inserts directly and this up drastically. | |
| merge on read | insert | Similar to above | |
| merge on read | upsert/ de-duplicate & insert | Indexing performance would
remain the same as copy-on-write, while ingest latency for updates (costliest
I/O operation in copy on write) are sent to log files and thus with
asynchronous compaction provides very good ingest performance with low write
amplification. | |
diff --git a/website/src/pages/roadmap.md b/website/src/pages/roadmap.md
index f678c3785137..09ea94411823 100644
--- a/website/src/pages/roadmap.md
+++ b/website/src/pages/roadmap.md
@@ -24,59 +24,57 @@ down by areas on our [stack](/docs/hudi_stack).
| Feature | Target Release |
Tracking
|
|------------------------------------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Introduce `.abort` state in the timeline | 1.2.0 |
[HUDI-8189](https://issues.apache.org/jira/browse/HUDI-8189) |
-| Variant type support on Spark 4 | 1.2.0 |
[HUDI-9046](https://issues.apache.org/jira/browse/HUDI-9046) |
-| Non-blocking updates during clustering | 1.2.0 |
[HUDI-1045](https://issues.apache.org/jira/browse/HUDI-1045)
|
-| Enable partial updates for CDC workload payload | 1.2.0 |
[HUDI-7229](https://issues.apache.org/jira/browse/HUDI-7229)
|
-| Schema tracking in metadata table | 1.2.0 |
[HUDI-6778](https://issues.apache.org/jira/browse/HUDI-6778) |
-| NBCC for MDT writes | 1.2.0 |
[HUDI-8480](https://issues.apache.org/jira/browse/HUDI-8480) |
-| Index abstraction for writer and reader | 1.2.0 |
[HUDI-9176](https://issues.apache.org/jira/browse/HUDI-9176) |
-| Vector search index | 1.2.0 |
[HUDI-9047](https://issues.apache.org/jira/browse/HUDI-9047) |
-| Bitmap index | 1.2.0 |
[HUDI-9048](https://issues.apache.org/jira/browse/HUDI-9048) |
+| Introduce `.abort` state in the timeline | 1.2.0 |
[#16609](https://github.com/apache/hudi/issues/16609) |
+| Variant type support on Spark 4 | 1.2.0 |
[#16851](https://github.com/apache/hudi/issues/16851) |
+| Non-blocking updates during clustering | 1.2.0 |
[#14611](https://github.com/apache/hudi/issues/14611)
|
+| Enable partial updates for CDC workload payload | 1.2.0 |
[#16354](https://github.com/apache/hudi/issues/16354)
|
+| Schema tracking in metadata table | 1.2.0 |
[#14397](https://github.com/apache/hudi/issues/14397) |
+| NBCC for MDT writes | 1.2.0 |
[#17305](https://github.com/apache/hudi/issues/17305) |
+| Index abstraction for writer and reader | 1.2.0 |
[#16903](https://github.com/apache/hudi/issues/16903) |
+| Vector search index | 1.2.0 |
[#16852](https://github.com/apache/hudi/issues/16852) |
+| Bitmap index | 1.2.0 |
[#16853](https://github.com/apache/hudi/issues/16853) |
| New abstraction for schema, expressions, and filters | 1.2.0 |
[RFC-88](https://github.com/apache/hudi/pull/12795) |
-| Streaming CDC/Incremental read improvement | 1.2.0 |
[HUDI-2749](https://issues.apache.org/jira/browse/HUDI-2749) |
-| Supervised table service planning and execution | 1.2.0 |
[RFC-43](https://github.com/apache/hudi/pull/4309),
[HUDI-4147](https://issues.apache.org/jira/browse/HUDI-4147)
|
-| General purpose support for multi-table transactions | 1.2.0 |
[HUDI-6709](https://issues.apache.org/jira/browse/HUDI-6709) |
-| Supporting different updated columns in a single partial update log file |
1.2.0 | [HUDI-9049](https://issues.apache.org/jira/browse/HUDI-9049) |
-| CDC format consolidation | 1.2.0 |
[HUDI-7538](https://issues.apache.org/jira/browse/HUDI-7538) |
-| Time Travel updates, deletes | 1.3.0 |
[HUDI-9050](https://issues.apache.org/jira/browse/HUDI-9050) |
-| Unstructured data storage and management | 1.3.0 |
[HUDI-9051](https://issues.apache.org/jira/browse/HUDI-9051)|
-
+| Streaming CDC/Incremental read improvement | 1.2.0 |
[#14916](https://github.com/apache/hudi/issues/14916) |
+| Supervised table service planning and execution | 1.2.0 |
[RFC-43](https://github.com/apache/hudi/pull/4309),
[#15196](https://github.com/apache/hudi/issues/15196)
|
+| General purpose support for multi-table transactions | 1.2.0 |
[#16181](https://github.com/apache/hudi/issues/16181) |
+| Supporting different updated columns in a single partial update log file |
1.2.0 | [#16854](https://github.com/apache/hudi/issues/16854) |
+| CDC format consolidation | 1.2.0 |
[#16429](https://github.com/apache/hudi/issues/16429) |
+| Time Travel updates, deletes | 1.3.0 |
[#16855](https://github.com/apache/hudi/issues/16855) |
+| Unstructured data storage and management | 1.3.0 |
[#16856](https://github.com/apache/hudi/issues/16856)|
## Programming APIs
| Feature | Target Release |
Tracking
|
|---------------------------------------------------------|----------------|----------------------------------------------------------------------------------------------------------------------------|
-| New Hudi Table Format APIs for Query Integrations | 1.2.0 |
[RFC-64](https://github.com/apache/hudi/pull/7080),
[HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) |
-| Snapshot view management | 1.2.0 |
[RFC-61](https://github.com/apache/hudi/pull/6576),
[HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677) |
-| Support of verification with multiple event_time fields | 1.2.0 |
[RFC-59](https://github.com/apache/hudi/pull/6382),
[HUDI-4569](https://issues.apache.org/jira/browse/HUDI-4569) |
-
+| New Hudi Table Format APIs for Query Integrations | 1.2.0 |
[RFC-64](https://github.com/apache/hudi/pull/7080),
[#15194](https://github.com/apache/hudi/issues/15194) |
+| Snapshot view management | 1.2.0 |
[RFC-61](https://github.com/apache/hudi/pull/6576),
[#15367](https://github.com/apache/hudi/issues/15367) |
+| Support of verification with multiple event_time fields | 1.2.0 |
[RFC-59](https://github.com/apache/hudi/pull/6382),
[#15325](https://github.com/apache/hudi/issues/15325) |
## Query Engine Integration
| Feature | Target Release |
Tracking
|
|---------------------------------------------------------|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Default Java 17 support | 1.2.0
| [HUDI-6506](https://issues.apache.org/jira/browse/HUDI-6506)
|
-| Spark datasource V2 read | 1.2.0 |
[HUDI-4449](https://issues.apache.org/jira/browse/HUDI-4449)
|
-| Simplification of engine integration and module organization | 1.2.0
| [HUDI-9502](https://issues.apache.org/jira/browse/HUDI-9502) |
-| End-to-end DataFrame write path on Spark | 1.2.0 |
[HUDI-9019](https://issues.apache.org/jira/browse/HUDI-9019),
[HUDI-4857](https://issues.apache.org/jira/browse/HUDI-4857) |
-| Support Hudi 1.0 release in Presto Hudi Connector | Presto Release /
Q2 | [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210) |
-| Support of new indexes in Presto Hudi Connector | Presto Release /
Q3 | [HUDI-4394](https://issues.apache.org/jira/browse/HUDI-4394),
[HUDI-4552](https://issues.apache.org/jira/browse/HUDI-4552) |
-| MDT support in Trino Hudi Connector | Trino Release / Q2
| [HUDI-2687](https://issues.apache.org/jira/browse/HUDI-2687) |
-| Support of new indexes in Trino Hudi Connector | Trino Release / Q3
| [HUDI-4394](https://issues.apache.org/jira/browse/HUDI-4394),
[HUDI-4552](https://issues.apache.org/jira/browse/HUDI-4552) |
+| Default Java 17 support | 1.2.0
| [#16082](https://github.com/apache/hudi/issues/16082)
|
+| Spark datasource V2 read | 1.2.0 |
[#15292](https://github.com/apache/hudi/issues/15292)
|
+| Simplification of engine integration and module organization | 1.2.0
| [#17044](https://github.com/apache/hudi/issues/16857) |
+| End-to-end DataFrame write path on Spark | 1.2.0 |
[#16846](https://github.com/apache/hudi/issues/16846),
[#15433](https://github.com/apache/hudi/issues/15433) |
+| Support Hudi 1.0 release in Presto Hudi Connector | Presto Release /
Q2 | [#14992](https://github.com/apache/hudi/issues/14992) |
+| Support of new indexes in Presto Hudi Connector | Presto Release /
Q3 | [#15246](https://github.com/apache/hudi/issues/15246),
[#15319](https://github.com/apache/hudi/issues/15319) |
+| MDT support in Trino Hudi Connector | Trino Release / Q2
| [#14906](https://github.com/apache/hudi/issues/14906) |
+| Support of new indexes in Trino Hudi Connector | Trino Release / Q3
| [#15246](https://github.com/apache/hudi/issues/15246),
[#15319](https://github.com/apache/hudi/issues/15319) |
## Platform Components
| Feature
| Target Release | Tracking
|
|---------------------------------------------------------------------------------------------------|----------------|----------------------------------------------------------------------------------------------------------------------------------------|
-| Syncing as non-partitoned tables in catalogs | 1.2.0 |
[HUDI-9503](https://issues.apache.org/jira/browse/HUDI-9503) |
+| Syncing as non-partitoned tables in catalogs | 1.2.0 |
[#17045](https://github.com/apache/hudi/issues/16858) |
| Hudi Reverse streamer
| 1.2.0 |
[RFC-70](https://github.com/apache/hudi/pull/9040)
|
| Diagnostic Reporter
| 1.2.0 |
[RFC-62](https://github.com/apache/hudi/pull/6600)
|
-| Mutable, Transactional caching for Hudi Tables (could be accelerated based
on community feedback) | 2.0.0 | [Strawman
design](https://docs.google.com/presentation/d/1QBgLw11TM2Qf1KUESofGrQDb63EuggNCpPaxc82Kldo/edit#slide=id.gf7e0551254_0_5),
[HUDI-6489](https://issues.apache.org/jira/browse/HUDI-6489) |
-| Hudi Metaserver (could be accelerated based on community feedback)
| 2.0.0 |
[HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345),
[RFC-36](https://github.com/apache/hudi/pull/4718) |
-
+| Mutable, Transactional caching for Hudi Tables (could be accelerated based
on community feedback) | 2.0.0 | [Strawman
design](https://docs.google.com/presentation/d/1QBgLw11TM2Qf1KUESofGrQDb63EuggNCpPaxc82Kldo/edit#slide=id.gf7e0551254_0_5),
[#16072](https://github.com/apache/hudi/issues/16072) |
+| Hudi Metaserver (could be accelerated based on community feedback)
| 2.0.0 |
[#15011](https://github.com/apache/hudi/issues/15011),
[RFC-36](https://github.com/apache/hudi/pull/4718) |
## Developer Experience
+
| Feature | Target Release |
Tracking |
|---------------------------------------------------------|----------------|------------------------------------------|
-| Clean up tech debt and deprecate unused code | 1.2.0 |
[HUDI-9054](https://issues.apache.org/jira/browse/HUDI-9054) |
+| Clean up tech debt and deprecate unused code | 1.2.0 |
[#16859](https://github.com/apache/hudi/issues/16859) |
diff --git a/website/versioned_docs/version-0.14.0/comparison.md
b/website/versioned_docs/version-0.14.0/comparison.md
index 0bcce2ace532..30ededd13a83 100644
--- a/website/versioned_docs/version-0.14.0/comparison.md
+++ b/website/versioned_docs/version-0.14.0/comparison.md
@@ -53,4 +53,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/versioned_docs/version-0.14.0/faq.md
b/website/versioned_docs/version-0.14.0/faq.md
index ef31aab3d1c5..3fc1742fada5 100644
--- a/website/versioned_docs/version-0.14.0/faq.md
+++ b/website/versioned_docs/version-0.14.0/faq.md
@@ -280,8 +280,8 @@ The speed at which you can write into Hudi depends on the
[write operation](http
| ---| ---| ---| --- |
| copy on write | bulk\_insert | Should match vanilla spark writing + an
additional sort to properly size files | properly size [bulk insert
parallelism](https://hudi.apache.org/docs/configurations#hoodiebulkinsertshuffleparallelism)
to get right number of files. use insert if you want this auto tuned .
Configure
[hoodie.bulkinsert.sort.mode](https://hudi.apache.org/docs/configurations#hoodiebulkinsertsortmode)
for better file sizes at the cost of memory. The default value NONE offers th
[...]
| copy on write | insert | Similar to bulk insert, except the file sizes are
auto tuned requiring input to be cached into memory and custom partitioned. |
Performance would be bound by how parallel you can write the ingested data.
Tune [this
limit](https://hudi.apache.org/docs/configurations#hoodieinsertshuffleparallelism)
up, if you see that writes are happening from only a few executors. |
-| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or \<50% updates. Compared to naively
overwriting entire partitions, Hudi write can be several magnitudes faster
depending on how many files in a given partition is actually updated. For e.g,
if a partition has 1000 f [...]
-| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy\_on\_write bulk insert.
This has the nice side-effect of getting data into parquet directly for query
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add
support for logging inserts directly and this up drastically. | |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or \<50% updates. Compared to naively
overwriting entire partitions, Hudi write can be several magnitudes faster
depending on how many files in a given partition is actually updated. For e.g,
if a partition has 1000 f [...]
+| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy\_on\_write bulk insert.
This has the nice side-effect of getting data into parquet directly for query
performance. [This GitHub issue](https://github.com/apache/hudi/issues/14468)
will add support for logging inserts directly and this up drastically. | |
| merge on read | insert | Similar to above | |
| merge on read | upsert/ de-duplicate & insert | Indexing performance would
remain the same as copy-on-write, while ingest latency for updates (costliest
I/O operation in copy\_on\_write) are sent to log files and thus with
asynchronous compaction provides very very good ingest performance with low
write amplification. | |
diff --git a/website/versioned_docs/version-0.14.0/metadata_indexing.md
b/website/versioned_docs/version-0.14.0/metadata_indexing.md
index a478e82bdabf..43669dbca59e 100644
--- a/website/versioned_docs/version-0.14.0/metadata_indexing.md
+++ b/website/versioned_docs/version-0.14.0/metadata_indexing.md
@@ -208,4 +208,4 @@ Asynchronous indexing feature is still evolving. Few points
to note from deploym
For example, if async indexing is disabled and metadata is enabled along
with column stats index type, then both files and column stats index will be
created synchronously with ingestion.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
diff --git a/website/versioned_docs/version-0.14.0/sql_queries.md
b/website/versioned_docs/version-0.14.0/sql_queries.md
index 2da7c3acd46d..e38ff1eb9994 100644
--- a/website/versioned_docs/version-0.14.0/sql_queries.md
+++ b/website/versioned_docs/version-0.14.0/sql_queries.md
@@ -215,8 +215,7 @@ separated) and calls InputFormat.listStatus() only once
with all those partition
It supports [querying Hudi
tables](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html) using
the Hive connector.
Currently, it supports snapshot queries on COPY_ON_WRITE tables, and snapshot
and read optimized queries on MERGE_ON_READ Hudi tables.
-:::note The most recent release of Athena that supports querying Hudi 0.14.0
tables has a bug that causes _ro query to return 0 records, and occasionally
_rt the query to fail with class cast exception.
-The issue is tracked in
[HUDI-7362](https://issues.apache.org/jira/browse/HUDI-7362) and is expected to
be fixed in the next release.
+:::note The most recent release of Athena that supports querying Hudi 0.14.0
tables has a bug that causes `_ro` query to return 0 records, and occasionally
`_rt` the query to fail with class cast exception. This is expected to be fixed
in 0.15.0.
:::
## Presto
@@ -241,7 +240,7 @@ Please check the below table for query types supported and
installation instruct
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
diff --git a/website/versioned_docs/version-0.14.1/comparison.md
b/website/versioned_docs/version-0.14.1/comparison.md
index 0bcce2ace532..30ededd13a83 100644
--- a/website/versioned_docs/version-0.14.1/comparison.md
+++ b/website/versioned_docs/version-0.14.1/comparison.md
@@ -53,4 +53,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/versioned_docs/version-0.14.1/faq_writing_tables.md
b/website/versioned_docs/version-0.14.1/faq_writing_tables.md
index 1d9e64eab680..1b6dcfb6bdbe 100644
--- a/website/versioned_docs/version-0.14.1/faq_writing_tables.md
+++ b/website/versioned_docs/version-0.14.1/faq_writing_tables.md
@@ -124,8 +124,8 @@ The speed at which you can write into Hudi depends on the
[write operation](writ
| ---| ---| ---| --- |
| copy on write | bulk\_insert | Should match vanilla spark writing + an
additional sort to properly size files | properly size [bulk insert
parallelism](configurations#hoodiebulkinsertshuffleparallelism) to get right
number of files. use insert if you want this auto tuned . Configure
[hoodie.bulkinsert.sort.mode](configurations#hoodiebulkinsertsortmode) for
better file sizes at the cost of memory. The default value NONE offers the
fastest performance and matches `spark.write.parquet()` [...]
| copy on write | insert | Similar to bulk insert, except the file sizes are
auto tuned requiring input to be cached into memory and custom partitioned. |
Performance would be bound by how parallel you can write the ingested data.
Tune [this limit](configurations#hoodieinsertshuffleparallelism) up, if you see
that writes are happening from only a few executors. |
-| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or \<50% updates. Compared to naively
overwriting entire partitions, Hudi write can be several magnitudes faster
depending on how many files in a given partition is actually updated. For e.g,
if a partition has 1000 f [...]
-| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy\_on\_write bulk insert.
This has the nice side-effect of getting data into parquet directly for query
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add
support for logging inserts directly and this up drastically. | |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or \<50% updates. Compared to naively
overwriting entire partitions, Hudi write can be several magnitudes faster
depending on how many files in a given partition is actually updated. For e.g,
if a partition has 1000 f [...]
+| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy\_on\_write bulk insert.
This has the nice side-effect of getting data into parquet directly for query
performance. [This GitHub issue](https://github.com/apache/hudi/issues/14468)
will add support for logging inserts directly and this up drastically. | |
| merge on read | insert | Similar to above | |
| merge on read | upsert/ de-duplicate & insert | Indexing performance would
remain the same as copy-on-write, while ingest latency for updates (costliest
I/O operation in copy\_on\_write) are sent to log files and thus with
asynchronous compaction provides very very good ingest performance with low
write amplification. | |
diff --git a/website/versioned_docs/version-0.14.1/metadata_indexing.md
b/website/versioned_docs/version-0.14.1/metadata_indexing.md
index c2a1984b7d14..54a446a0b7ed 100644
--- a/website/versioned_docs/version-0.14.1/metadata_indexing.md
+++ b/website/versioned_docs/version-0.14.1/metadata_indexing.md
@@ -208,7 +208,7 @@ Asynchronous indexing feature is still evolving. Few points
to note from deploym
For example, if async indexing is disabled and metadata is enabled along
with column stats index type, then both files and column stats index will be
created synchronously with ingestion.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
## Related Resources
<h3>Videos</h3>
diff --git a/website/versioned_docs/version-0.14.1/sql_queries.md
b/website/versioned_docs/version-0.14.1/sql_queries.md
index 0c3edab8633c..8276af0e0e6d 100644
--- a/website/versioned_docs/version-0.14.1/sql_queries.md
+++ b/website/versioned_docs/version-0.14.1/sql_queries.md
@@ -223,8 +223,7 @@ separated) and calls InputFormat.listStatus() only once
with all those partition
It supports [querying Hudi
tables](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html) using
the Hive connector.
Currently, it supports snapshot queries on COPY_ON_WRITE tables, and snapshot
and read optimized queries on MERGE_ON_READ Hudi tables.
-:::note The most recent release of Athena that supports querying Hudi 0.14.0
tables has a bug that causes _ro query to return 0 records, and occasionally
_rt the query to fail with class cast exception.
-The issue is tracked in
[HUDI-7362](https://issues.apache.org/jira/browse/HUDI-7362) and is expected to
be fixed in the next release.
+:::note The most recent release of Athena that supports querying Hudi 0.14.0
tables has a bug that causes `_ro` query to return 0 records, and occasionally
`_rt` the query to fail with class cast exception. This is expected to be fixed
in 0.15.0.
:::
## Presto
@@ -249,7 +248,7 @@ Please check the below table for query types supported and
installation instruct
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
diff --git a/website/versioned_docs/version-0.15.0/comparison.md
b/website/versioned_docs/version-0.15.0/comparison.md
index 0bcce2ace532..30ededd13a83 100644
--- a/website/versioned_docs/version-0.15.0/comparison.md
+++ b/website/versioned_docs/version-0.15.0/comparison.md
@@ -53,4 +53,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/versioned_docs/version-0.15.0/faq_writing_tables.md
b/website/versioned_docs/version-0.15.0/faq_writing_tables.md
index e208ea5fa492..c08dfb92a5fa 100644
--- a/website/versioned_docs/version-0.15.0/faq_writing_tables.md
+++ b/website/versioned_docs/version-0.15.0/faq_writing_tables.md
@@ -124,8 +124,8 @@ The speed at which you can write into Hudi depends on the
[write operation](writ
| ---| ---| ---| --- |
| copy on write | bulk\_insert | Should match vanilla spark writing + an
additional sort to properly size files | properly size [bulk insert
parallelism](configurations#hoodiebulkinsertshuffleparallelism) to get right
number of files. use insert if you want this auto tuned . Configure
[hoodie.bulkinsert.sort.mode](configurations#hoodiebulkinsertsortmode) for
better file sizes at the cost of memory. The default value NONE offers the
fastest performance and matches `spark.write.parquet()` [...]
| copy on write | insert | Similar to bulk insert, except the file sizes are
auto tuned requiring input to be cached into memory and custom partitioned. |
Performance would be bound by how parallel you can write the ingested data.
Tune [this limit](configurations#hoodieinsertshuffleparallelism) up, if you see
that writes are happening from only a few executors. |
-| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or \<50% updates. Compared to naively
overwriting entire partitions, Hudi write can be several magnitudes faster
depending on how many files in a given partition is actually updated. For e.g,
if a partition has 1000 f [...]
-| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy\_on\_write bulk insert.
This has the nice side-effect of getting data into parquet directly for query
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add
support for logging inserts directly and this up drastically. | |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or \<50% updates. Compared to naively
overwriting entire partitions, Hudi write can be several magnitudes faster
depending on how many files in a given partition is actually updated. For e.g,
if a partition has 1000 f [...]
+| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy\_on\_write bulk insert.
This has the nice side-effect of getting data into parquet directly for query
performance. [This GitHub issue](https://github.com/apache/hudi/issues/14468)
will add support for logging inserts directly and this up drastically. | |
| merge on read | insert | Similar to above | |
| merge on read | upsert/ de-duplicate & insert | Indexing performance would
remain the same as copy-on-write, while ingest latency for updates (costliest
I/O operation in copy\_on\_write) are sent to log files and thus with
asynchronous compaction provides very very good ingest performance with low
write amplification. | |
diff --git a/website/versioned_docs/version-0.15.0/hudi_stack.md
b/website/versioned_docs/version-0.15.0/hudi_stack.md
index 13c8a603f86d..af8bfb8396ec 100644
--- a/website/versioned_docs/version-0.15.0/hudi_stack.md
+++ b/website/versioned_docs/version-0.15.0/hudi_stack.md
@@ -71,13 +71,13 @@ Hudi's scalable metadata table contains auxiliary data
about the table. This sub

<p align = "center">Figure: Proposed Lake Cache in Hudi</p>
-Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
+Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
### Metaserver*

<p align = "center">Figure: Proposed Metaserver in Hudi</p>
-Storing table metadata on lake storage, while scalable, is less efficient than
RPCs to a scalable meta server. Hudi addresses this with its metadata server,
called "metaserver," an efficient alternative for managing table metadata for a
large number of tables. Currently, the timeline server, embedded in Hudi's
writer processes, uses a local rocksDB store and [Javalin](https://javalin.io/)
REST API to serve file listings, reducing cloud storage listings. Since version
0.6.0, there's a tre [...]
+Storing table metadata on lake storage, while scalable, is less efficient than
RPCs to a scalable meta server. Hudi addresses this with its metadata server,
called "metaserver," an efficient alternative for managing table metadata for a
large number of tables. Currently, the timeline server, embedded in Hudi's
writer processes, uses a local rocksDB store and [Javalin](https://javalin.io/)
REST API to serve file listings, reducing cloud storage listings. Since version
0.6.0, there's a tre [...]
## Programming APIs
diff --git a/website/versioned_docs/version-0.15.0/metadata.md
b/website/versioned_docs/version-0.15.0/metadata.md
index 426a02d17218..69b1929377de 100644
--- a/website/versioned_docs/version-0.15.0/metadata.md
+++ b/website/versioned_docs/version-0.15.0/metadata.md
@@ -134,7 +134,7 @@ from different engines as shown below:
|----------------------------------------------------------------------------------|------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| <ul><li>Spark DataSource</li><li>Spark SQL</li><li>Strucured
Streaming</li></ul> | hoodie.metadata.enable | When set to `true` enables use
of the spark file index implementation for Hudi, that speeds up listing of
large tables.<br /> |
|Presto|
[hudi.metadata-table-enabled](https://prestodb.io/docs/current/connector/hudi.html)
| When set to `true` fetches the list of file names and sizes from
Hudi’s metadata table rather than storage. |
-|Trino| N/A | Support for reading from the metadata table [has been dropped in
Trino 419](https://issues.apache.org/jira/browse/HUDI-7020). |
+|Trino| N/A | Support for reading from the metadata table [has been dropped in
Trino 419](https://github.com/apache/hudi/issues/16286). |
|Athena|
[hudi.metadata-listing-enabled](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
| When this table property is set to `TRUE` enables the Hudi metadata table
and the related file listing functionality |
|<ul><li>Flink DataStream</li><li>Flink SQL</li></ul> | metadata.enabled |
When set to `true` from DDL uses the internal metadata table to serves table
metadata like level file listings |
diff --git a/website/versioned_docs/version-0.15.0/metadata_indexing.md
b/website/versioned_docs/version-0.15.0/metadata_indexing.md
index c2a1984b7d14..54a446a0b7ed 100644
--- a/website/versioned_docs/version-0.15.0/metadata_indexing.md
+++ b/website/versioned_docs/version-0.15.0/metadata_indexing.md
@@ -208,7 +208,7 @@ Asynchronous indexing feature is still evolving. Few points
to note from deploym
For example, if async indexing is disabled and metadata is enabled along
with column stats index type, then both files and column stats index will be
created synchronously with ingestion.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
## Related Resources
<h3>Videos</h3>
diff --git a/website/versioned_docs/version-0.15.0/sql_queries.md
b/website/versioned_docs/version-0.15.0/sql_queries.md
index 69962ddf09b4..856433121654 100644
--- a/website/versioned_docs/version-0.15.0/sql_queries.md
+++ b/website/versioned_docs/version-0.15.0/sql_queries.md
@@ -223,10 +223,6 @@ separated) and calls InputFormat.listStatus() only once
with all those partition
It supports [querying Hudi
tables](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html) using
the Hive connector.
Currently, it supports snapshot queries on COPY_ON_WRITE tables, and snapshot
and read optimized queries on MERGE_ON_READ Hudi tables.
-:::note The most recent release of Athena that supports querying Hudi 0.14.0
tables has a bug that causes _ro query to return 0 records, and occasionally
_rt the query to fail with class cast exception.
-The issue is tracked in
[HUDI-7362](https://issues.apache.org/jira/browse/HUDI-7362) and is expected to
be fixed in the next release.
-:::
-
## Presto
[Presto](https://prestodb.io/) is a popular query engine for interactive query
performance. Support for querying Hudi tables using PrestoDB is offered
@@ -249,7 +245,7 @@ Please check the below table for query types supported and
installation instruct
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
diff --git a/website/versioned_docs/version-1.0.0/comparison.md
b/website/versioned_docs/version-1.0.0/comparison.md
index 0bcce2ace532..30ededd13a83 100644
--- a/website/versioned_docs/version-1.0.0/comparison.md
+++ b/website/versioned_docs/version-1.0.0/comparison.md
@@ -53,4 +53,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/versioned_docs/version-1.0.0/faq_writing_tables.md
b/website/versioned_docs/version-1.0.0/faq_writing_tables.md
index 44bdde55a28e..fe061d1b491a 100644
--- a/website/versioned_docs/version-1.0.0/faq_writing_tables.md
+++ b/website/versioned_docs/version-1.0.0/faq_writing_tables.md
@@ -124,8 +124,8 @@ The speed at which you can write into Hudi depends on the
[write operation](writ
| ---| ---| ---| --- |
| copy on write | bulk_insert | Should match vanilla spark writing + an
additional sort to properly size files | properly size [bulk insert
parallelism](configurations#hoodiebulkinsertshuffleparallelism) to get right
number of files. Use insert if you want this auto tuned. Configure
[hoodie.bulkinsert.sort.mode](configurations#hoodiebulkinsertsortmode) for
better file sizes at the cost of memory. The default value `NONE` offers the
fastest performance and matches `spark.write.parquet()` [...]
| copy on write | insert | Similar to bulk insert, except the file sizes are
auto tuned requiring input to be cached into memory and custom partitioned. |
Performance would be bound by how parallel you can write the ingested data.
Tune [this limit](configurations#hoodieinsertshuffleparallelism) up, if you see
that writes are happening from only a few executors. |
-| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
-| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add
support for logging inserts directly and this up drastically. | |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
+| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [This GitHub issue](https://github.com/apache/hudi/issues/14468)
will add support for logging inserts directly and this up drastically. | |
| merge on read | insert | Similar to above | |
| merge on read | upsert/ de-duplicate & insert | Indexing performance would
remain the same as copy-on-write, while ingest latency for updates (costliest
I/O operation in copy on write) are sent to log files and thus with
asynchronous compaction provides very good ingest performance with low write
amplification. | |
diff --git a/website/versioned_docs/version-1.0.0/hudi_stack.md
b/website/versioned_docs/version-1.0.0/hudi_stack.md
index 47a5368431c8..214a1ba2f7aa 100644
--- a/website/versioned_docs/version-1.0.0/hudi_stack.md
+++ b/website/versioned_docs/version-1.0.0/hudi_stack.md
@@ -122,7 +122,7 @@ is introduced, allowing multiple writers to concurrently
operate on the table wi

<p align = "center">Figure: Proposed Lake Cache in Hudi</p>
-Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
+Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
## Programming APIs
@@ -168,7 +168,7 @@ interactive engines such as Trino and Presto.
Storing table metadata on lake storage, while scalable, is less efficient than
RPCs to a scalable meta server. Hudi addresses this with its metadata server,
called "metaserver,"
an efficient alternative for managing table metadata for a large number of
tables. Currently, the timeline server, embedded in Hudi's writer processes,
uses a local rocksDB store and [Javalin](https://javalin.io/) REST API to serve
file listings, reducing cloud storage listings.
-Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://issues.apache.org/jira/browse/HUDI-3345)
+Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://github.com/apache/hudi/issues/15011)
for future needs.
diff --git a/website/versioned_docs/version-1.0.0/metadata.md
b/website/versioned_docs/version-1.0.0/metadata.md
index 6ad199e7dec6..7606dc08bce0 100644
--- a/website/versioned_docs/version-1.0.0/metadata.md
+++ b/website/versioned_docs/version-1.0.0/metadata.md
@@ -100,7 +100,7 @@ from different engines as shown below:
| Spark DataSource, Spark SQL, Strucured Streaming | hoodie.metadata.enable |
When set to `true` enables use of the spark file index implementation for Hudi,
that speeds up listing of large tables.<br /> |
| Flink DataStream, Flink SQL | metadata.enabled | When set to
`true` from DDL uses the internal metadata table to serves table metadata like
level file listings |
| Presto |
[hudi.metadata-table-enabled](https://prestodb.io/docs/current/connector/hudi.html)
| When set to `true` fetches the list of file names and sizes from
Hudi’s metadata table rather than storage. |
-| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://issues.apache.org/jira/browse/HUDI-7020). |
+| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://github.com/apache/hudi/issues/16286). |
| Athena |
[hudi.metadata-listing-enabled](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
| When this table property is set to `TRUE` enables the Hudi metadata table
and the related file listing functionality |
### column_stats index and data skipping
diff --git a/website/versioned_docs/version-1.0.0/metadata_indexing.md
b/website/versioned_docs/version-1.0.0/metadata_indexing.md
index 7fc0a552e050..475b68beb9a4 100644
--- a/website/versioned_docs/version-1.0.0/metadata_indexing.md
+++ b/website/versioned_docs/version-1.0.0/metadata_indexing.md
@@ -310,7 +310,7 @@ Asynchronous indexing feature is still evolving. Few points
to note from deploym
think that particular index was disabled and cleanup the metadata partition.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
## Related Resources
<h3>Videos</h3>
diff --git a/website/versioned_docs/version-1.0.0/sql_queries.md
b/website/versioned_docs/version-1.0.0/sql_queries.md
index f3ff058862d0..1ab1f1dc8289 100644
--- a/website/versioned_docs/version-1.0.0/sql_queries.md
+++ b/website/versioned_docs/version-1.0.0/sql_queries.md
@@ -534,10 +534,6 @@ separated) and calls InputFormat.listStatus() only once
with all those partition
It supports [querying Hudi
tables](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html) using
the Hive connector.
Currently, it supports snapshot queries on COPY_ON_WRITE tables, and snapshot
and read optimized queries on MERGE_ON_READ Hudi tables.
-:::note The most recent release of Athena that supports querying Hudi 0.14.0
tables has a bug that causes _ro query to return 0 records, and occasionally
_rt the query to fail with class cast exception.
-The issue is tracked in
[HUDI-7362](https://issues.apache.org/jira/browse/HUDI-7362) and is expected to
be fixed in the next release.
-:::
-
## Presto
[Presto](https://prestodb.io/) is a popular query engine for interactive query
performance. Support for querying Hudi tables using PrestoDB is offered
@@ -560,7 +556,7 @@ Please check the below table for query types supported and
installation instruct
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
diff --git a/website/versioned_docs/version-1.0.1/comparison.md
b/website/versioned_docs/version-1.0.1/comparison.md
index 7ba799e1453e..30ededd13a83 100644
--- a/website/versioned_docs/version-1.0.1/comparison.md
+++ b/website/versioned_docs/version-1.0.1/comparison.md
@@ -53,4 +53,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
\ No newline at end of file
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/versioned_docs/version-1.0.1/faq_writing_tables.md
b/website/versioned_docs/version-1.0.1/faq_writing_tables.md
index 44bdde55a28e..fe061d1b491a 100644
--- a/website/versioned_docs/version-1.0.1/faq_writing_tables.md
+++ b/website/versioned_docs/version-1.0.1/faq_writing_tables.md
@@ -124,8 +124,8 @@ The speed at which you can write into Hudi depends on the
[write operation](writ
| ---| ---| ---| --- |
| copy on write | bulk_insert | Should match vanilla spark writing + an
additional sort to properly size files | properly size [bulk insert
parallelism](configurations#hoodiebulkinsertshuffleparallelism) to get right
number of files. Use insert if you want this auto tuned. Configure
[hoodie.bulkinsert.sort.mode](configurations#hoodiebulkinsertsortmode) for
better file sizes at the cost of memory. The default value `NONE` offers the
fastest performance and matches `spark.write.parquet()` [...]
| copy on write | insert | Similar to bulk insert, except the file sizes are
auto tuned requiring input to be cached into memory and custom partitioned. |
Performance would be bound by how parallel you can write the ingested data.
Tune [this limit](configurations#hoodieinsertshuffleparallelism) up, if you see
that writes are happening from only a few executors. |
-| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
-| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add
support for logging inserts directly and this up drastically. | |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
+| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [This GitHub issue](https://github.com/apache/hudi/issues/14468)
will add support for logging inserts directly and this up drastically. | |
| merge on read | insert | Similar to above | |
| merge on read | upsert/ de-duplicate & insert | Indexing performance would
remain the same as copy-on-write, while ingest latency for updates (costliest
I/O operation in copy on write) are sent to log files and thus with
asynchronous compaction provides very good ingest performance with low write
amplification. | |
diff --git a/website/versioned_docs/version-1.0.1/hudi_stack.md
b/website/versioned_docs/version-1.0.1/hudi_stack.md
index 3ea30c6028ed..7aba8d5842fd 100644
--- a/website/versioned_docs/version-1.0.1/hudi_stack.md
+++ b/website/versioned_docs/version-1.0.1/hudi_stack.md
@@ -122,7 +122,7 @@ is introduced, allowing multiple writers to concurrently
operate on the table wi

<p align = "center">Figure: Proposed Lake Cache in Hudi</p>
-Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
+Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
## Programming APIs
@@ -168,7 +168,7 @@ interactive engines such as Trino and Presto.
Storing table metadata on lake storage, while scalable, is less efficient than
RPCs to a scalable meta server. Hudi addresses this with its metadata server,
called "metaserver,"
an efficient alternative for managing table metadata for a large number of
tables. Currently, the timeline server, embedded in Hudi's writer processes,
uses a local rocksDB store and [Javalin](https://javalin.io/) REST API to serve
file listings, reducing cloud storage listings.
-Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://issues.apache.org/jira/browse/HUDI-3345)
+Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://github.com/apache/hudi/issues/15011)
for future needs.
diff --git a/website/versioned_docs/version-1.0.1/metadata.md
b/website/versioned_docs/version-1.0.1/metadata.md
index fe8827ebeec5..a9df1ffc82d9 100644
--- a/website/versioned_docs/version-1.0.1/metadata.md
+++ b/website/versioned_docs/version-1.0.1/metadata.md
@@ -100,7 +100,7 @@ from different engines as shown below:
| Spark DataSource, Spark SQL, Strucured Streaming | hoodie.metadata.enable |
When set to `true` enables use of the spark file index implementation for Hudi,
that speeds up listing of large tables.<br /> |
| Flink DataStream, Flink SQL | metadata.enabled | When set to
`true` from DDL uses the internal metadata table to serves table metadata like
level file listings |
| Presto |
[hudi.metadata-table-enabled](https://prestodb.io/docs/current/connector/hudi.html)
| When set to `true` fetches the list of file names and sizes from
Hudi’s metadata table rather than storage. |
-| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://issues.apache.org/jira/browse/HUDI-7020). |
+| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://github.com/apache/hudi/issues/16286). |
| Athena |
[hudi.metadata-listing-enabled](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
| When this table property is set to `TRUE` enables the Hudi metadata table
and the related file listing functionality |
### column_stats index and data skipping
diff --git a/website/versioned_docs/version-1.0.1/metadata_indexing.md
b/website/versioned_docs/version-1.0.1/metadata_indexing.md
index ff4b05772df4..79b08b3583a4 100644
--- a/website/versioned_docs/version-1.0.1/metadata_indexing.md
+++ b/website/versioned_docs/version-1.0.1/metadata_indexing.md
@@ -310,7 +310,7 @@ Asynchronous indexing feature is still evolving. Few points
to note from deploym
think that particular index was disabled and cleanup the metadata partition.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
## Related Resources
<h3>Videos</h3>
diff --git a/website/versioned_docs/version-1.0.1/sql_queries.md
b/website/versioned_docs/version-1.0.1/sql_queries.md
index e01af000c436..1ab1f1dc8289 100644
--- a/website/versioned_docs/version-1.0.1/sql_queries.md
+++ b/website/versioned_docs/version-1.0.1/sql_queries.md
@@ -556,7 +556,7 @@ Please check the below table for query types supported and
installation instruct
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
diff --git a/website/versioned_docs/version-1.0.2/comparison.md
b/website/versioned_docs/version-1.0.2/comparison.md
index 0bcce2ace532..30ededd13a83 100644
--- a/website/versioned_docs/version-1.0.2/comparison.md
+++ b/website/versioned_docs/version-1.0.2/comparison.md
@@ -53,4 +53,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/versioned_docs/version-1.0.2/faq_writing_tables.md
b/website/versioned_docs/version-1.0.2/faq_writing_tables.md
index 4555d60dbbf8..5d33a88d9dff 100644
--- a/website/versioned_docs/version-1.0.2/faq_writing_tables.md
+++ b/website/versioned_docs/version-1.0.2/faq_writing_tables.md
@@ -124,8 +124,8 @@ The speed at which you can write into Hudi depends on the
[write operation](writ
| ---| ---| ---| --- |
| copy on write | bulk_insert | Should match vanilla spark writing + an
additional sort to properly size files | properly size [bulk insert
parallelism](configurations#hoodiebulkinsertshuffleparallelism) to get right
number of files. Use insert if you want this auto tuned. Configure
[hoodie.bulkinsert.sort.mode](configurations#hoodiebulkinsertsortmode) for
better file sizes at the cost of memory. The default value `NONE` offers the
fastest performance and matches `spark.write.parquet()` [...]
| copy on write | insert | Similar to bulk insert, except the file sizes are
auto tuned requiring input to be cached into memory and custom partitioned. |
Performance would be bound by how parallel you can write the ingested data.
Tune [this limit](configurations#hoodieinsertshuffleparallelism) up, if you see
that writes are happening from only a few executors. |
-| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
-| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add
support for logging inserts directly and this up drastically. | |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to
identify the affected records, Hudi indexing is often 7-10x faster as long as
you have ordered keys (discussed below) or less than 50% updates. Compared to
naively overwriting entire partitions, Hudi write can be several magnitudes
faster depending on how many files in a given partition is actually updated.
For example, if a partitio [...]
+| merge on read | bulk insert | Currently new data only goes to parquet files
and thus performance here should be similar to copy on write bulk insert. This
has the nice side-effect of getting data into parquet directly for query
performance. [This GitHub issue](https://github.com/apache/hudi/issues/14468)
will add support for logging inserts directly and this up drastically. | |
| merge on read | insert | Similar to above | |
| merge on read | upsert/ de-duplicate & insert | Indexing performance would
remain the same as copy-on-write, while ingest latency for updates (costliest
I/O operation in copy on write) are sent to log files and thus with
asynchronous compaction provides very good ingest performance with low write
amplification. | |
diff --git a/website/versioned_docs/version-1.0.2/hudi_stack.md
b/website/versioned_docs/version-1.0.2/hudi_stack.md
index 47a5368431c8..214a1ba2f7aa 100644
--- a/website/versioned_docs/version-1.0.2/hudi_stack.md
+++ b/website/versioned_docs/version-1.0.2/hudi_stack.md
@@ -122,7 +122,7 @@ is introduced, allowing multiple writers to concurrently
operate on the table wi

<p align = "center">Figure: Proposed Lake Cache in Hudi</p>
-Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
+Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
## Programming APIs
@@ -168,7 +168,7 @@ interactive engines such as Trino and Presto.
Storing table metadata on lake storage, while scalable, is less efficient than
RPCs to a scalable meta server. Hudi addresses this with its metadata server,
called "metaserver,"
an efficient alternative for managing table metadata for a large number of
tables. Currently, the timeline server, embedded in Hudi's writer processes,
uses a local rocksDB store and [Javalin](https://javalin.io/) REST API to serve
file listings, reducing cloud storage listings.
-Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://issues.apache.org/jira/browse/HUDI-3345)
+Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://github.com/apache/hudi/issues/15011)
for future needs.
diff --git a/website/versioned_docs/version-1.0.2/metadata.md
b/website/versioned_docs/version-1.0.2/metadata.md
index fe8827ebeec5..a9df1ffc82d9 100644
--- a/website/versioned_docs/version-1.0.2/metadata.md
+++ b/website/versioned_docs/version-1.0.2/metadata.md
@@ -100,7 +100,7 @@ from different engines as shown below:
| Spark DataSource, Spark SQL, Strucured Streaming | hoodie.metadata.enable |
When set to `true` enables use of the spark file index implementation for Hudi,
that speeds up listing of large tables.<br /> |
| Flink DataStream, Flink SQL | metadata.enabled | When set to
`true` from DDL uses the internal metadata table to serves table metadata like
level file listings |
| Presto |
[hudi.metadata-table-enabled](https://prestodb.io/docs/current/connector/hudi.html)
| When set to `true` fetches the list of file names and sizes from
Hudi’s metadata table rather than storage. |
-| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://issues.apache.org/jira/browse/HUDI-7020). |
+| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://github.com/apache/hudi/issues/16286). |
| Athena |
[hudi.metadata-listing-enabled](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
| When this table property is set to `TRUE` enables the Hudi metadata table
and the related file listing functionality |
### column_stats index and data skipping
diff --git a/website/versioned_docs/version-1.0.2/metadata_indexing.md
b/website/versioned_docs/version-1.0.2/metadata_indexing.md
index ffacbdf20fe9..29959625cb35 100644
--- a/website/versioned_docs/version-1.0.2/metadata_indexing.md
+++ b/website/versioned_docs/version-1.0.2/metadata_indexing.md
@@ -310,7 +310,7 @@ Asynchronous indexing feature is still evolving. Few points
to note from deploym
think that particular index was disabled and cleanup the metadata partition.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
## Related Resources
<h3>Videos</h3>
diff --git a/website/versioned_docs/version-1.0.2/sql_queries.md
b/website/versioned_docs/version-1.0.2/sql_queries.md
index 28f86c6a14c8..7d70f7353805 100644
--- a/website/versioned_docs/version-1.0.2/sql_queries.md
+++ b/website/versioned_docs/version-1.0.2/sql_queries.md
@@ -556,7 +556,7 @@ Please check the below table for query types supported and
installation instruct
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows:
diff --git a/website/versioned_docs/version-1.1.0/comparison.md
b/website/versioned_docs/version-1.1.0/comparison.md
index 0bcce2ace532..30ededd13a83 100644
--- a/website/versioned_docs/version-1.1.0/comparison.md
+++ b/website/versioned_docs/version-1.1.0/comparison.md
@@ -53,4 +53,4 @@ of PrestoDB/SparkSQL/Hive for your queries.
More advanced use cases revolve around the concepts of [incremental
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
which effectively
uses Hudi even inside the `processing` engine to speed up typical batch
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG
(similar
to how
[rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend)
is used by Flink). This is an item on the roadmap
-and will eventually happen as a [Beam
Runner](https://issues.apache.org/jira/browse/HUDI-60)
+and will eventually happen as a [Beam
Runner](https://github.com/apache/hudi/issues/14452)
diff --git a/website/versioned_docs/version-1.1.0/hudi_stack.md
b/website/versioned_docs/version-1.1.0/hudi_stack.md
index 67e2bdbb1b8f..fbc3a05e7930 100644
--- a/website/versioned_docs/version-1.1.0/hudi_stack.md
+++ b/website/versioned_docs/version-1.1.0/hudi_stack.md
@@ -141,7 +141,7 @@ is introduced, allowing multiple writers to concurrently
operate on the table wi

<p align = "center">Proposed Lake Cache in Hudi</p>
-Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
+Data lakes today face a tradeoff between fast data writing and optimal query
performance. Writing smaller files or logging deltas enhances writing speed,
but superior query performance typically requires opening fewer files and
pre-materializing merges. Most databases use a buffer pool to reduce storage
access costs. Hudi’s design supports creating a multi-tenant caching tier that
can store pre-merged File Slices. Hudi’s timeline can then be used to simply
communicate caching policies. T [...]
## Programming APIs
@@ -192,5 +192,5 @@ interactive engines such as Trino and Presto.
Storing table metadata on lake storage, while scalable, is less efficient than
RPCs to a scalable meta server. Hudi addresses this with its metadata server,
called "metaserver,"
an efficient alternative for managing table metadata for a large number of
tables. Currently, the timeline server, embedded in Hudi's writer processes,
uses a local rocksDB store and [Javalin](https://javalin.io/) REST API to serve
file listings, reducing cloud storage listings.
-Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://issues.apache.org/jira/browse/HUDI-3345)
+Since version 0.6.0, there's a trend towards standalone timeline servers,
aimed at horizontal scaling and improved security. These developments are set
to create a more efficient lake
[metastore](https://github.com/apache/hudi/issues/15011)
for future needs.
diff --git a/website/versioned_docs/version-1.1.0/metadata.md
b/website/versioned_docs/version-1.1.0/metadata.md
index fe8827ebeec5..a9df1ffc82d9 100644
--- a/website/versioned_docs/version-1.1.0/metadata.md
+++ b/website/versioned_docs/version-1.1.0/metadata.md
@@ -100,7 +100,7 @@ from different engines as shown below:
| Spark DataSource, Spark SQL, Strucured Streaming | hoodie.metadata.enable |
When set to `true` enables use of the spark file index implementation for Hudi,
that speeds up listing of large tables.<br /> |
| Flink DataStream, Flink SQL | metadata.enabled | When set to
`true` from DDL uses the internal metadata table to serves table metadata like
level file listings |
| Presto |
[hudi.metadata-table-enabled](https://prestodb.io/docs/current/connector/hudi.html)
| When set to `true` fetches the list of file names and sizes from
Hudi’s metadata table rather than storage. |
-| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://issues.apache.org/jira/browse/HUDI-7020). |
+| Trino | N/A | Support for reading
from the metadata table [has been dropped in Trino
419](https://github.com/apache/hudi/issues/16286). |
| Athena |
[hudi.metadata-listing-enabled](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
| When this table property is set to `TRUE` enables the Hudi metadata table
and the related file listing functionality |
### column_stats index and data skipping
diff --git a/website/versioned_docs/version-1.1.0/metadata_indexing.md
b/website/versioned_docs/version-1.1.0/metadata_indexing.md
index 86df7c58061c..4d94ce5f47c4 100644
--- a/website/versioned_docs/version-1.1.0/metadata_indexing.md
+++ b/website/versioned_docs/version-1.1.0/metadata_indexing.md
@@ -310,7 +310,7 @@ Asynchronous indexing feature is still evolving. Few points
to note from deploym
think that particular index was disabled and cleanup the metadata partition.
Some of these limitations will be removed in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for
developments on this feature.
+follow [this GitHub issue](https://github.com/apache/hudi/issues/14870) for
developments on this feature.
## Related Resources
<h3>Videos</h3>
diff --git a/website/versioned_docs/version-1.1.0/sql_queries.md
b/website/versioned_docs/version-1.1.0/sql_queries.md
index aba1d5845c5b..cdd348b08a07 100644
--- a/website/versioned_docs/version-1.1.0/sql_queries.md
+++ b/website/versioned_docs/version-1.1.0/sql_queries.md
@@ -556,7 +556,7 @@ Please check the below table for query types supported and
installation instruct
:::note
Incremental queries and point in time queries are not supported either through
the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
-under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).
+under [this GitHub issue](https://github.com/apache/hudi/issues/14992).
:::
To use the Hudi connector, please configure hudi catalog in `
/presto-server-0.2xxx/etc/catalog/hudi.properties` as follows: