This is an automated email from the ASF dual-hosted git repository.
codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7529570682e [DOCS] Add release highlights for 1.0.0 release (#12475)
7529570682e is described below
commit 7529570682e111af847fd7d6f968abff797df355
Author: Sagar Sumit <[email protected]>
AuthorDate: Fri Dec 13 09:07:02 2024 +0530
[DOCS] Add release highlights for 1.0.0 release (#12475)
* Add release highlights for 1.0.0 release
* Code Review comments for release-1.0.0.md
* Fix links and address review comments wrt upgrading
* Add limitations
---------
Co-authored-by: vinoth chandar <[email protected]>
---
website/docs/concurrency_control.md | 5 +
website/docs/deployment.md | 26 +++++
website/docs/sql_ddl.md | 38 ++++---
website/docs/sql_dml.md | 8 ++
website/docs/sql_queries.md | 2 +-
website/releases/release-1.0.0-beta2.md | 2 +-
website/releases/release-1.0.0.md | 172 ++++++++++++++++++++++++++++++++
7 files changed, 230 insertions(+), 23 deletions(-)
diff --git a/website/docs/concurrency_control.md
b/website/docs/concurrency_control.md
index e14bd1c8206..549f1ddd17e 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -214,6 +214,11 @@ currently available for preview in version 1.0.0-beta only
with the caveat that
between clustering and ingestion. It works for compaction and ingestion, and
we can see an example of that with Flink
writers [here](sql_dml#non-blocking-concurrency-control-experimental).
+:::note
+`NON_BLOCKING_CONCURRENCY_CONTROL` between ingestion writer and table service
writer is not yet supported for clustering.
+Please use `OPTIMISTIC_CONCURRENCY_CONTROL` for clustering.
+:::
+
## Early conflict Detection
Multi writing using OCC allows multiple writers to concurrently write and
atomically commit to the Hudi table if there is no overlapping data file to be
written, to guarantee data consistency, integrity and correctness. Prior to
0.13.0 release, as the OCC (optimistic concurrency control) name suggests, each
writer will optimistically proceed with ingestion and towards the end, just
before committing will go about conflict resolution flow to deduce overlapping
writes and abort one if need [...]
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 3e572867e79..1c5a41a0acb 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -165,6 +165,32 @@ As general guidelines,
Note that release notes can override this information with specific
instructions, applicable on case-by-case basis.
+### Upgrading to 1.0.0
+
+1.0.0 is a major release with significant format changes. To ensure a smooth
migration experience, we recommend the
+following steps:
+
+1. Stop any async table services in 0.x completely.
+2. Upgrade writers to 1.x with table version (tv) 6, `autoUpgrade` and
metadata disabled (this won't auto-upgrade anything);
+ 0.x readers will continue to work; writers can also be readers and will
continue to read both tv=6.
+ a. Set `hoodie.write.auto.upgrade` to false.
+ b. Set `hoodie.metadata.enable` to false.
+3. Upgrade table services to 1.x with tv=6, and resume operations.
+4. Upgrade all remaining readers to 1.x, with tv=6.
+5. Redeploy writers with tv=8; table services and readers will adapt/pick up
tv=8 on the fly.
+6. Once all readers and writers are in 1.x, we are good to enable any new
features, including metadata, with 1.x tables.
+
+During the upgrade, metadata table will not be updated and it will be behind
the data table. It is important to note
+that metadata table will be updated only when the writer is upgraded to tv=8.
So, even the readers should keep metadata
+disabled during rolling upgrade until all writers are upgraded to tv=8.
+
+:::caution
+Most things are seamlessly handled by the auto upgrade process, but there are
some limitations. Please read through the
+limitations of the upgrade downgrade process before proceeding to migrate.
Please
+check
[RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers)
+for more details.
+:::
+
## Downgrading
Upgrade is automatic whenever a new Hudi version is used whereas downgrade is
a manual step. We need to use the Hudi
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index c00b815ac4e..ed4b151c754 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -272,6 +272,7 @@ Both index and column on which the index is created can be
qualified with some o
Please note in order to create secondary index:
1. The table must have a primary key and merge mode should be
[COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
2. Record index must be enabled. This can be done by setting
`hoodie.metadata.record.index.enable=true` and then creating `record_index`.
Please note the example below.
+3. Secondary index is not supported for [complex
types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
:::
**Examples**
@@ -334,12 +335,18 @@ date based partitioning, provide same benefits to
queries, even if the physical
CREATE INDEX IF NOT EXISTS ts_datestr ON hudi_table
USING column_stats(ts)
OPTIONS(expr='from_unixtime', format='yyyy-MM-dd');
--- Create a expression index on the column `ts` (timestamp in yyyy-MM-dd
HH:mm:ss) of the table `hudi_table` using the function `hour`
+-- Create an expression index on the column `ts` (timestamp in yyyy-MM-dd
HH:mm:ss) of the table `hudi_table` using the function `hour`
CREATE INDEX ts_hour ON hudi_table
USING column_stats(ts)
options(expr='hour');
```
+:::note
+1. Expression index can only be created for Spark engine using SQL. It is not
supported yet with Spark DataSource API.
+2. Expression index is not yet supported for [complex
types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
+3. Expression index is supported for unary and certain binary expressions.
Please check [SQL DDL docs](sql_ddl#create-expression-index) for more details.
+ :::
+
The `expr` option is required for creating expression index, and it should be
a valid Spark SQL function. Please check the syntax
for the above functions in the [Spark SQL
documentation](https://spark.apache.org/docs/latest/sql-ref-functions.html) and
provide the options accordingly. For example,
the `format` option is required for `from_unixtime` function.
@@ -434,6 +441,12 @@ and execution.
To enable partition stats index, simply set
`hoodie.metadata.index.partition.stats.enable = 'true'` in create table options.
+:::note
+1. `column_stats` index is required to be enabled for `partition_stats` index.
Both go hand in hand.
+2. `partition_stats` index is not created automatically for all columns. Users
must specify list of columns for which they want to create partition stats
index.
+3. `column_stats` and `partition_stats` index is not yet supported for
[complex
types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
+:::
+
### Create Secondary Index
Secondary indexes are record level indexes built on any column in the table.
It supports multiple records having the same
@@ -441,11 +454,8 @@ secondary column value efficiently and is built on top of
the existing record le
Secondary indexes are hash based indexes that offer horizontally scalable
write performance by splitting key space into shards
by hashing, as well as fast lookups by employing row-based file formats.
-:::note
-Please note in order to create secondary index:
-1. The table must have a primary key and merge mode should be
[COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
-2. Record index must be enabled. This can be done by setting
`hoodie.metadata.record.index.enable=true` and then creating `record_index`.
Please note the example below.
-:::
+Let us now look at an example of creating a table with multiple indexes and
how the query leverage the indexes for both
+partition pruning and data skipping.
```sql
DROP TABLE IF EXISTS hudi_table;
@@ -513,24 +523,10 @@ Bloom filter indexes store a bloom filter per file, on
the column or column expr
effective in skipping files that don't contain a high cardinality column value
e.g. uuids.
```sql
-CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING
bloom_filters(driver) OPTIONS(expr='identity');
+-- Create a bloom filter index on the column derived from expression
`lower(rider)` of the table `hudi_table`
CREATE INDEX idx_bloom_rider ON hudi_indexed_table USING bloom_filters(rider)
OPTIONS(expr='lower');
```
-
-### Limitations
-
-- Unlike column stats, partition stats index is not created automatically for
all columns. Users must specify list of
- columns for which they want to create partition stats index.
-- Predicate on internal meta fields such as `_hoodie_record_key` or
`_hoodie_partition_path` cannot be used for data
- skipping. Queries with such predicates cannot leverage the indexes.
-- Secondary index is not supported for nested fields.
-- Secondary index can be created only if record index is available in the table
-- Secondary index can only be used for tables using
OverwriteWithLatestAvroPayload payload or COMMIT_TIME_ORDERING merge mode
-- Column stats Expression Index can not be created using `identity` expression
with SQL. Users can leverage column stat index using Datasource instead.
-- Index update can fail with schema evolution.
-- Only one index can be created at a time using [async
indexer](metadata_indexing).
-
### Setting Hudi configs
There are different ways you can pass the configs for a given hudi table.
diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
index 6f5fe28a3eb..8b4200154ec 100644
--- a/website/docs/sql_dml.md
+++ b/website/docs/sql_dml.md
@@ -212,6 +212,14 @@ SELECT id, name, price, _ts, description FROM tableName;
Notice, instead of `UPDATE SET *`, we are updating only the `price` and `_ts`
columns.
+:::note
+Partial update is not yet supported in the following cases:
+1. When the target table is a bootstrapped table.
+2. When virtual keys is enabled.
+3. When schema on read is enabled.
+4. When there is an enum field in the source data.
+:::
+
### Delete From
You can remove data from a Hudi table using the `DELETE FROM` statement.
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index f96ddca3bb6..5e7e3a45089 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -38,7 +38,7 @@ using path filters. We expect that native integration with
Spark's optimized tab
management will yield great performance benefits in those versions.
:::
-### Snapshot Query without Index Acceleration
+### Snapshot Query with Index Acceleration
In this section we would go over the various indexes and how they help in data
skipping in Hudi. We will first create
a hudi table without any index.
diff --git a/website/releases/release-1.0.0-beta2.md
b/website/releases/release-1.0.0-beta2.md
index 698b2aa3c6e..5cb174e366f 100644
--- a/website/releases/release-1.0.0-beta2.md
+++ b/website/releases/release-1.0.0-beta2.md
@@ -1,6 +1,6 @@
---
title: "Release 1.0.0-beta2"
-sidebar_position: 1
+sidebar_position: 3
layout: releases
toc: true
---
diff --git a/website/releases/release-1.0.0.md
b/website/releases/release-1.0.0.md
new file mode 100644
index 00000000000..f4d309b517a
--- /dev/null
+++ b/website/releases/release-1.0.0.md
@@ -0,0 +1,172 @@
+---
+title: "Release 1.0.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 1.0.0](https://github.com/apache/hudi/releases/tag/release-1.0.0)
([docs](/docs/quick-start-guide))
+
+Apache Hudi 1.0.0 is a major milestone release of Apache Hudi. This release
contains significant format changes and new exciting features
+as we will see below.
+
+## Migration Guide
+
+We encourage users to try the **1.0.0** features on new tables first. The 1.0
general availability (GA) release will
+support automatic table upgrades from 0.x versions while also ensuring full
backward compatibility when reading 0.x
+Hudi tables using 1.0, ensuring a seamless migration experience.
+
+This release comes with **backward compatible writes** i.e. 1.0.0 can write in
both the table version 8 (latest) and older
+table version 6 (corresponds to 0.14 & above) formats. Automatic upgrades for
tables from 0.x versions are fully
+supported, minimizing migration challenges. Until all the readers are
upgraded, users can still deploy 1.0.0 binaries
+for the writers and leverage backward compatible writes to continue writing
the tables in the older format. Once the readers
+are fully upgraded, users can switch to the latest format through a config
change. We recommend users to follow the upgrade
+steps mentioned in the [migration guide](/docs/deployment#upgrading-to-100) to
ensure a smooth transition.
+
+:::caution
+Most things are seamlessly handled by the auto upgrade process, but there are
some limitations. Please read through the
+limitations of the upgrade downgrade process before proceeding to migrate.
Please check the [migration guide](/docs/deployment#upgrading-to-100)
+and
[RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers)
for more details.
+:::
+
+## Bundle Updates
+
+ - Same bundles supported in the [0.15.0
release](release-0.15.0#new-spark-bundles) are still supported.
+ - New Flink Bundles to support Flink 1.19 and Flink 1.20.
+ - In this release, we have deprecated support for Spark 3.2 or lower version
in Spark 3.
+
+## Highlights
+
+### Format changes
+
+The main epic covering all the format changes is
[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242), which is also
+covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The
following are the main highlights with respect to format changes:
+
+#### Timeline
+
+- The active and archived timeline dichotomy has been done away with a more
scalable LSM tree based
+ timeline. The timeline layout is now more organized and efficient for
time-range queries and scaling to infinite history.
+- As a result, timeline layout has been changed, and it has been moved to
`.hoodie/timeline` directory under the base
+ path of the table.
+- There are changes to the timeline instant files as well:
+ - All commit metadata is serialized to Avro, allowing for future
compatibility and uniformity in metadata across all
+ actions.
+ - Instant files for completed actions now include a completion time.
+ - Action for the pending clustering instant is now renamed to `clustering`
to make it distinct from other
+ `replacecommit` actions.
+
+#### Log File Format
+
+- In addition to the keys in the log file header, we also store record
positions. Refer to the
+ latest [spec](/tech-specs-1point0#log-format) for more details. This allows
us to do position-based merging (apart
+ from key-based merging) and skip pages based on positions.
+- Log file name will now have the deltacommit instant time instead of base
commit instant time.
+- The new log file format also enables fast partial updates with low storage
overhead.
+
+### Compatibility with Old Formats
+
+- **Backward Compatible writes:** Hudi 1.0 writes now support writing in both
the table version 8 (latest) and older table version 6 (corresponds to 0.14 &
above) formats, ensuring seamless
+ integration with existing setups.
+- **Automatic upgrades**: for tables from 0.x versions are fully supported,
minimizing migration challenges. We also recommend users first try migrating to
0.14 &
+ above, if you have advanced setups with multiple readers/writers/table
services.
+
+### Concurrency Control
+
+1.0.0 introduces **Non-Blocking Concurrency Control (NBCC)**, enabling
multi-stream concurrent ingestion without
+conflict. This is a general-purpose concurrency model aimed at the stream
processing or high-contention/frequent writing
+scenarios. In contrast to Optimistic Concurrency Control, where writers abort
the transaction if there is a hint of
+contention, this innovation allows multiple streaming writes to the same Hudi
table without any overhead of conflict
+resolution, while keeping the semantics of event-time ordering found in
streaming systems, along with asynchronous table
+service such as compaction, archiving and cleaning.
+
+To learn more about NBCC, refer to [this
blog](/blog/2024/12/06/non-blocking-concurrency-control) which also includes a
demo with Flink writers.
+
+### New Indices
+
+1.0.0 introduces new indices to the multi-modal indexing subsystem of Apache
Hudi. These indices are designed to improve
+query performance through partition pruning and further data skipping.
+
+#### Secondary Index
+
+The **secondary index** allows users to create indexes on columns that are not
part of record key columns in Hudi
+tables. It can be used to speed up queries with predicates on columns other
than record key columns.
+
+#### Partition Stats Index
+
+The **partition stats index** aggregates statistics at the partition level for
the columns for which it is enabled. This
+helps in efficient partition pruning even for non-partition fields.
+
+#### Expression Index
+
+The **expression index** enables efficient queries on columns derived from
expressions. It can collect stats on columns
+derived from expressions without materializing them, and can be used to speed
up queries with filters containing such
+expressions.
+
+To learn more about these indices, refer to the [SQL
queries](/docs/sql_queries#snapshot-query-with-index-acceleration) docs.
+
+### Partial Updates
+
+1.0.0 extends support for partial updates to Merge-on-Read tables, which
allows users to update only a subset of columns
+in a record. This feature is useful when users want to update only a few
columns in a record without rewriting the
+entire record.
+
+To learn more about partial updates, refer to the [SQL
DML](/docs/sql_dml#merge-into-partial-update) docs.
+
+### Multiple Base File Formats in a single table
+
+- Support for multiple base file formats (e.g., **Parquet**, **ORC**,
**HFile**) within a single Hudi table, allowing
+ tailored formats for specific use cases like indexing and ML applications.
+- It is also useful when users want to switch from one file
+ format to another, e.g. from ORC to Parquet, without rewriting the whole
table.
+- **Configuration:** Enable with
`hoodie.table.multiple.base.file.formats.enable`.
+
+To learn more about the format changes, refer to the [Hudi 1.0 tech
specification](/tech-specs-1point0).
+
+### API Changes
+
+1.0.0 introduces several API changes, including:
+
+#### Record Merger API
+
+`HoodieRecordPayload` interface is deprecated in favor of the new
`HoodieRecordMerger` interface. Record merger is a
+generic interface that allows users to define custom logic for merging base
file and log file records. This release
+comes with a few out-of-the-box merge modes, which define how the base and log
files are ordered in a file slice and
+further how different records with the same record key within that file slice
are merged consistently to produce the
+same deterministic results for snapshot queries, writers and table services.
Specifically, there are three merge modes
+supported as a table-level configuration:
+
+- `COMMIT_TIME_ORDERING`: Merging simply picks the record belonging to the
latest write (commit time) as the merged
+ result.
+- `EVENT_TIME_ORDERING`: Merging picks the record with the highest value on a
user specified ordering or precombine
+ field as the merged result.
+- `CUSTOM`: Users can provide custom merger implementation to have better
control over the merge logic.
+
+:::note
+Going forward, we recommend users to migrate and use the record merger APIs
and not write new payload implementations.
+:::
+
+#### Positional Merging with Filegroup Reader
+
+- **Position-Based Merging:** Offers an alternative to key-based merging,
allowing for page skipping based on record
+ positions. Enabled by default for Spark and Hive.
+- **Configuration:** Activate positional merging using
`hoodie.merge.use.record.positions=true`.
+
+The new reader has shown impressive performance gains for **partial updates**
with key-based merging. For a MOR table of
+size 1TB with 100 partitions and 80% random updates in subsequent commits, the
new reader is **5.7x faster** for
+snapshot queries with **70x reduced write amplification**.
+
+### Flink Enhancements
+
+- **Lookup Joins:** Flink now supports lookup joins, enabling table enrichment
with external data sources.
+- **Partition Stats Index Support:** As mentioned above, partition stats
support is now available for Flink, bringing
+ efficient partition pruning to streaming workloads.
+- **Non-Blocking Concurrency Control:** NBCC is now available for Flink
streaming writers, allowing for multi-stream
+ concurrent ingestion without conflict.
+
+## Call to Action
+
+The 1.0.0 GA release is the culmination of extensive development, testing, and
feedback. We invite you to upgrade and
+experience the new features and enhancements.