(hudi) branch asf-site updated: [DOCS] Add release highlights for 1.0.0 release (#12475)

codope Thu, 12 Dec 2024 19:44:12 -0800

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 7529570682e [DOCS] Add release highlights for 1.0.0 release (#12475)
7529570682e is described below

commit 7529570682e111af847fd7d6f968abff797df355
Author: Sagar Sumit <[email protected]>
AuthorDate: Fri Dec 13 09:07:02 2024 +0530

    [DOCS] Add release highlights for 1.0.0 release (#12475)
    
    * Add release highlights for 1.0.0 release
    
    * Code Review comments for release-1.0.0.md
    
    * Fix links and address review comments wrt upgrading
    
    * Add limitations
    
    ---------
    
    Co-authored-by: vinoth chandar <[email protected]>
---
 website/docs/concurrency_control.md     |   5 +
 website/docs/deployment.md              |  26 +++++
 website/docs/sql_ddl.md                 |  38 ++++---
 website/docs/sql_dml.md                 |   8 ++
 website/docs/sql_queries.md             |   2 +-
 website/releases/release-1.0.0-beta2.md |   2 +-
 website/releases/release-1.0.0.md       | 172 ++++++++++++++++++++++++++++++++
 7 files changed, 230 insertions(+), 23 deletions(-)

diff --git a/website/docs/concurrency_control.md 
b/website/docs/concurrency_control.md
index e14bd1c8206..549f1ddd17e 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -214,6 +214,11 @@ currently available for preview in version 1.0.0-beta only 
with the caveat that
 between clustering and ingestion. It works for compaction and ingestion, and 
we can see an example of that with Flink
 writers [here](sql_dml#non-blocking-concurrency-control-experimental).
 
+:::note
+`NON_BLOCKING_CONCURRENCY_CONTROL` between ingestion writer and table service 
writer is not yet supported for clustering.
+Please use `OPTIMISTIC_CONCURRENCY_CONTROL` for clustering.
+:::
+
 ## Early conflict Detection
 
 Multi writing using OCC allows multiple writers to concurrently write and 
atomically commit to the Hudi table if there is no overlapping data file to be 
written, to guarantee data consistency, integrity and correctness. Prior to 
0.13.0 release, as the OCC (optimistic concurrency control) name suggests, each 
writer will optimistically proceed with ingestion and towards the end, just 
before committing will go about conflict resolution flow to deduce overlapping 
writes and abort one if need [...]
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 3e572867e79..1c5a41a0acb 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -165,6 +165,32 @@ As general guidelines,
 
 Note that release notes can override this information with specific 
instructions, applicable on case-by-case basis.
 
+### Upgrading to 1.0.0
+
+1.0.0 is a major release with significant format changes. To ensure a smooth 
migration experience, we recommend the
+following steps:
+
+1. Stop any async table services in 0.x completely.
+2. Upgrade writers to 1.x with table version (tv) 6, `autoUpgrade` and 
metadata disabled (this won't auto-upgrade anything);
+   0.x readers will continue to work; writers can also be readers and will 
continue to read both tv=6.
+   a. Set `hoodie.write.auto.upgrade` to false.
+   b. Set `hoodie.metadata.enable` to false.
+3. Upgrade table services to 1.x with tv=6, and resume operations.
+4. Upgrade all remaining readers to 1.x, with tv=6.
+5. Redeploy writers with tv=8; table services and readers will adapt/pick up 
tv=8 on the fly.
+6. Once all readers and writers are in 1.x, we are good to enable any new 
features, including metadata, with 1.x tables.
+
+During the upgrade, metadata table will not be updated and it will be behind 
the data table. It is important to note
+that metadata table will be updated only when the writer is upgraded to tv=8. 
So, even the readers should keep metadata
+disabled during rolling upgrade until all writers are upgraded to tv=8.
+
+:::caution
+Most things are seamlessly handled by the auto upgrade process, but there are 
some limitations. Please read through the
+limitations of the upgrade downgrade process before proceeding to migrate. 
Please
+check 
[RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers)
+for more details.
+:::
+
 ## Downgrading
 
 Upgrade is automatic whenever a new Hudi version is used whereas downgrade is 
a manual step. We need to use the Hudi
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index c00b815ac4e..ed4b151c754 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -272,6 +272,7 @@ Both index and column on which the index is created can be 
qualified with some o
 Please note in order to create secondary index:
 1. The table must have a primary key and merge mode should be 
[COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
 2. Record index must be enabled. This can be done by setting 
`hoodie.metadata.record.index.enable=true` and then creating `record_index`. 
Please note the example below.
+3. Secondary index is not supported for [complex 
types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
 :::
 
 **Examples**
@@ -334,12 +335,18 @@ date based partitioning, provide same benefits to 
queries, even if the physical
 CREATE INDEX IF NOT EXISTS ts_datestr ON hudi_table 
   USING column_stats(ts) 
   OPTIONS(expr='from_unixtime', format='yyyy-MM-dd');
--- Create a expression index on the column `ts` (timestamp in yyyy-MM-dd 
HH:mm:ss) of the table `hudi_table` using the function `hour`
+-- Create an expression index on the column `ts` (timestamp in yyyy-MM-dd 
HH:mm:ss) of the table `hudi_table` using the function `hour`
 CREATE INDEX ts_hour ON hudi_table 
   USING column_stats(ts) 
   options(expr='hour');
 ```
 
+:::note
+1. Expression index can only be created for Spark engine using SQL. It is not 
supported yet with Spark DataSource API.
+2. Expression index is not yet supported for [complex 
types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
+3. Expression index is supported for unary and certain binary expressions. 
Please check [SQL DDL docs](sql_ddl#create-expression-index) for more details.
+   :::
+
 The `expr` option is required for creating expression index, and it should be 
a valid Spark SQL function. Please check the syntax 
 for the above functions in the [Spark SQL 
documentation](https://spark.apache.org/docs/latest/sql-ref-functions.html) and 
provide the options accordingly. For example, 
 the `format` option is required for `from_unixtime` function.
@@ -434,6 +441,12 @@ and execution.
 
 To enable partition stats index, simply set 
`hoodie.metadata.index.partition.stats.enable = 'true'` in create table options.
 
+:::note
+1. `column_stats` index is required to be enabled for `partition_stats` index. 
Both go hand in hand. 
+2. `partition_stats` index is not created automatically for all columns. Users 
must specify list of columns for which they want to create partition stats 
index.
+3. `column_stats` and `partition_stats` index is not yet supported for 
[complex 
types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
+:::
+
 ### Create Secondary Index
 
 Secondary indexes are record level indexes built on any column in the table. 
It supports multiple records having the same
@@ -441,11 +454,8 @@ secondary column value efficiently and is built on top of 
the existing record le
 Secondary indexes are hash based indexes that offer horizontally scalable 
write performance by splitting key space into shards 
 by hashing, as well as fast lookups by employing row-based file formats.
 
-:::note
-Please note in order to create secondary index:
-1. The table must have a primary key and merge mode should be 
[COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
-2. Record index must be enabled. This can be done by setting 
`hoodie.metadata.record.index.enable=true` and then creating `record_index`. 
Please note the example below.
-:::
+Let us now look at an example of creating a table with multiple indexes and 
how the query leverage the indexes for both
+partition pruning and data skipping.
 
 ```sql
 DROP TABLE IF EXISTS hudi_table;
@@ -513,24 +523,10 @@ Bloom filter indexes store a bloom filter per file, on 
the column or column expr
 effective in skipping files that don't contain a high cardinality column value 
e.g. uuids.
 
 ```sql
-CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING 
bloom_filters(driver) OPTIONS(expr='identity');
+-- Create a bloom filter index on the column derived from expression 
`lower(rider)` of the table `hudi_table`
 CREATE INDEX idx_bloom_rider ON hudi_indexed_table USING bloom_filters(rider) 
OPTIONS(expr='lower');
 ```
 
-
-### Limitations 
-
-- Unlike column stats, partition stats index is not created automatically for 
all columns. Users must specify list of
-  columns for which they want to create partition stats index.
-- Predicate on internal meta fields such as `_hoodie_record_key` or 
`_hoodie_partition_path` cannot be used for data
-  skipping. Queries with such predicates cannot leverage the indexes.
-- Secondary index is not supported for nested fields.
-- Secondary index can be created only if record index is available in the table
-- Secondary index can only be used for tables using 
OverwriteWithLatestAvroPayload payload or COMMIT_TIME_ORDERING merge mode 
-- Column stats Expression Index can not be created using `identity` expression 
with SQL. Users can leverage column stat index using Datasource instead.
-- Index update can fail with schema evolution.
-- Only one index can be created at a time using [async 
indexer](metadata_indexing).
-
 ### Setting Hudi configs 
 
 There are different ways you can pass the configs for a given hudi table. 
diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
index 6f5fe28a3eb..8b4200154ec 100644
--- a/website/docs/sql_dml.md
+++ b/website/docs/sql_dml.md
@@ -212,6 +212,14 @@ SELECT id, name, price, _ts, description FROM tableName;
 
 Notice, instead of `UPDATE SET *`, we are updating only the `price` and `_ts` 
columns.
 
+:::note
+Partial update is not yet supported in the following cases:
+1. When the target table is a bootstrapped table. 
+2. When virtual keys is enabled.
+3. When schema on read is enabled. 
+4. When there is an enum field in the source data.
+:::
+
 ### Delete From
 
 You can remove data from a Hudi table using the `DELETE FROM` statement.
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index f96ddca3bb6..5e7e3a45089 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -38,7 +38,7 @@ using path filters. We expect that native integration with 
Spark's optimized tab
 management will yield great performance benefits in those versions.
 :::
 
-### Snapshot Query without Index Acceleration
+### Snapshot Query with Index Acceleration
 
 In this section we would go over the various indexes and how they help in data 
skipping in Hudi. We will first create
 a hudi table without any index.
diff --git a/website/releases/release-1.0.0-beta2.md 
b/website/releases/release-1.0.0-beta2.md
index 698b2aa3c6e..5cb174e366f 100644
--- a/website/releases/release-1.0.0-beta2.md
+++ b/website/releases/release-1.0.0-beta2.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 1.0.0-beta2"
-sidebar_position: 1
+sidebar_position: 3
 layout: releases
 toc: true
 ---
diff --git a/website/releases/release-1.0.0.md 
b/website/releases/release-1.0.0.md
new file mode 100644
index 00000000000..f4d309b517a
--- /dev/null
+++ b/website/releases/release-1.0.0.md
@@ -0,0 +1,172 @@
+---
+title: "Release 1.0.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 1.0.0](https://github.com/apache/hudi/releases/tag/release-1.0.0) 
([docs](/docs/quick-start-guide))
+
+Apache Hudi 1.0.0 is a major milestone release of Apache Hudi. This release 
contains significant format changes and new exciting features 
+as we will see below.
+
+## Migration Guide
+
+We encourage users to try the **1.0.0** features on new tables first. The 1.0 
general availability (GA) release will
+support automatic table upgrades from 0.x versions while also ensuring full 
backward compatibility when reading 0.x
+Hudi tables using 1.0, ensuring a seamless migration experience. 
+
+This release comes with **backward compatible writes** i.e. 1.0.0 can write in 
both the table version 8 (latest) and older
+table version 6 (corresponds to 0.14 & above) formats. Automatic upgrades for 
tables from 0.x versions are fully
+supported, minimizing migration challenges. Until all the readers are 
upgraded, users can still deploy 1.0.0 binaries 
+for the writers and leverage backward compatible writes to continue writing 
the tables in the older format. Once the readers
+are fully upgraded, users can switch to the latest format through a config 
change. We recommend users to follow the upgrade 
+steps mentioned in the [migration guide](/docs/deployment#upgrading-to-100) to 
ensure a smooth transition.
+
+:::caution
+Most things are seamlessly handled by the auto upgrade process, but there are 
some limitations. Please read through the 
+limitations of the upgrade downgrade process before proceeding to migrate. 
Please check the [migration guide](/docs/deployment#upgrading-to-100) 
+and 
[RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers)
 for more details.
+:::
+
+## Bundle Updates
+
+ - Same bundles supported in the [0.15.0 
release](release-0.15.0#new-spark-bundles) are still supported.
+ - New Flink Bundles to support Flink 1.19 and Flink 1.20.
+ - In this release, we have deprecated support for Spark 3.2 or lower version 
in Spark 3.
+
+## Highlights
+
+### Format changes
+
+The main epic covering all the format changes is 
[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242), which is also 
+covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The 
following are the main highlights with respect to format changes:
+
+#### Timeline
+
+- The active and archived timeline dichotomy has been done away with a more 
scalable LSM tree based
+  timeline. The timeline layout is now more organized and efficient for 
time-range queries and scaling to infinite history.
+- As a result, timeline layout has been changed, and it has been moved to 
`.hoodie/timeline` directory under the base
+  path of the table.
+- There are changes to the timeline instant files as well:
+    - All commit metadata is serialized to Avro, allowing for future 
compatibility and uniformity in metadata across all
+      actions.
+    - Instant files for completed actions now include a completion time.
+    - Action for the pending clustering instant is now renamed to `clustering` 
to make it distinct from other
+      `replacecommit` actions.
+
+#### Log File Format
+
+- In addition to the keys in the log file header, we also store record 
positions. Refer to the
+  latest [spec](/tech-specs-1point0#log-format) for more details. This allows 
us to do position-based merging (apart
+  from key-based merging) and skip pages based on positions.
+- Log file name will now have the deltacommit instant time instead of base 
commit instant time.
+- The new log file format also enables fast partial updates with low storage 
overhead.
+
+### Compatibility with Old Formats
+
+- **Backward Compatible writes:** Hudi 1.0 writes now support writing in both 
the table version 8 (latest) and older table version 6 (corresponds to 0.14 & 
above) formats, ensuring seamless
+  integration with existing setups.
+- **Automatic upgrades**: for tables from 0.x versions are fully supported, 
minimizing migration challenges. We also recommend users first try migrating to 
0.14 &
+  above, if you have advanced setups with multiple readers/writers/table 
services.
+
+### Concurrency Control
+
+1.0.0 introduces **Non-Blocking Concurrency Control (NBCC)**, enabling 
multi-stream concurrent ingestion without
+conflict. This is a general-purpose concurrency model aimed at the stream 
processing or high-contention/frequent writing
+scenarios. In contrast to Optimistic Concurrency Control, where writers abort 
the transaction if there is a hint of
+contention, this innovation allows multiple streaming writes to the same Hudi 
table without any overhead of conflict
+resolution, while keeping the semantics of event-time ordering found in 
streaming systems, along with asynchronous table
+service such as compaction, archiving and cleaning.
+
+To learn more about NBCC, refer to [this 
blog](/blog/2024/12/06/non-blocking-concurrency-control) which also includes a 
demo with Flink writers.
+
+### New Indices
+
+1.0.0 introduces new indices to the multi-modal indexing subsystem of Apache 
Hudi. These indices are designed to improve
+query performance through partition pruning and further data skipping.
+
+#### Secondary Index
+
+The **secondary index** allows users to create indexes on columns that are not 
part of record key columns in Hudi
+tables. It can be used to speed up queries with predicates on columns other 
than record key columns.
+
+#### Partition Stats Index
+
+The **partition stats index** aggregates statistics at the partition level for 
the columns for which it is enabled. This
+helps in efficient partition pruning even for non-partition fields.
+
+#### Expression Index
+
+The **expression index** enables efficient queries on columns derived from 
expressions. It can collect stats on columns
+derived from expressions without materializing them, and can be used to speed 
up queries with filters containing such
+expressions.
+
+To learn more about these indices, refer to the [SQL 
queries](/docs/sql_queries#snapshot-query-with-index-acceleration) docs.
+
+### Partial Updates
+
+1.0.0 extends support for partial updates to Merge-on-Read tables, which 
allows users to update only a subset of columns
+in a record. This feature is useful when users want to update only a few 
columns in a record without rewriting the
+entire record.
+
+To learn more about partial updates, refer to the [SQL 
DML](/docs/sql_dml#merge-into-partial-update) docs.
+
+### Multiple Base File Formats in a single table
+
+- Support for multiple base file formats (e.g., **Parquet**, **ORC**, 
**HFile**) within a single Hudi table, allowing
+  tailored formats for specific use cases like indexing and ML applications.
+- It is also useful when users want to switch from one file
+  format to another, e.g. from ORC to Parquet, without rewriting the whole 
table.
+- **Configuration:** Enable with 
`hoodie.table.multiple.base.file.formats.enable`.
+
+To learn more about the format changes, refer to the [Hudi 1.0 tech 
specification](/tech-specs-1point0).
+
+### API Changes
+
+1.0.0 introduces several API changes, including:
+
+#### Record Merger API
+
+`HoodieRecordPayload` interface is deprecated in favor of the new 
`HoodieRecordMerger` interface. Record merger is a
+generic interface that allows users to define custom logic for merging base 
file and log file records. This release
+comes with a few out-of-the-box merge modes, which define how the base and log 
files are ordered in a file slice and
+further how different records with the same record key within that file slice 
are merged consistently to produce the
+same deterministic results for snapshot queries, writers and table services. 
Specifically, there are three merge modes
+supported as a table-level configuration:
+
+- `COMMIT_TIME_ORDERING`: Merging simply picks the record belonging to the 
latest write (commit time) as the merged
+  result.
+- `EVENT_TIME_ORDERING`: Merging picks the record with the highest value on a 
user specified ordering or precombine
+  field as the merged result.
+- `CUSTOM`: Users can provide custom merger implementation to have better 
control over the merge logic.
+
+:::note
+Going forward, we recommend users to migrate and use the record merger APIs 
and not write new payload implementations.
+:::
+
+#### Positional Merging with Filegroup Reader
+
+- **Position-Based Merging:** Offers an alternative to key-based merging, 
allowing for page skipping based on record
+  positions. Enabled by default for Spark and Hive.
+- **Configuration:** Activate positional merging using 
`hoodie.merge.use.record.positions=true`.
+
+The new reader has shown impressive performance gains for **partial updates** 
with key-based merging. For a MOR table of
+size 1TB with 100 partitions and 80% random updates in subsequent commits, the 
new reader is **5.7x faster** for
+snapshot queries with **70x reduced write amplification**.
+
+### Flink Enhancements
+
+- **Lookup Joins:** Flink now supports lookup joins, enabling table enrichment 
with external data sources.
+- **Partition Stats Index Support:** As mentioned above, partition stats 
support is now available for Flink, bringing
+  efficient partition pruning to streaming workloads.
+- **Non-Blocking Concurrency Control:** NBCC is now available for Flink 
streaming writers, allowing for multi-stream
+  concurrent ingestion without conflict.
+
+## Call to Action
+
+The 1.0.0 GA release is the culmination of extensive development, testing, and 
feedback. We invite you to upgrade and
+experience the new features and enhancements.

(hudi) branch asf-site updated: [DOCS] Add release highlights for 1.0.0 release (#12475)

Reply via email to