nsivabalan commented on code in PR #11618: URL: https://github.com/apache/hudi/pull/11618#discussion_r1676856952
########## website/releases/release-1.0.0-beta2.md: ########## @@ -0,0 +1,80 @@ +--- +title: "Release 1.0.0-beta2" +sidebar_position: 1 +layout: releases +toc: true +--- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +## [Release 1.0.0-beta2](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta2) ([docs](/docs/next/quick-start-guide)) + +Apache Hudi 1.0.0-beta2 is the second beta release of Apache Hudi. This release is meant for early adopters to try +out the new features and provide feedback. The release is not meant for production use. + +## Migration Guide + +This release contains major format changes as we will see in highlights below. We encourage users to try out the +**1.0.0-beta2** features on new tables. The 1.0 general availability (GA) release will support automatic table upgrades +from 0.x versions, while also ensuring full backward compatibility when reading 0.x Hudi tables using 1.0, ensuring a +seamless migration experience. + +:::caution +Given that timeline format and log file format has changed in this **beta release**, it is recommended not to attempt to do +rolling upgrades from older versions to this release. +::: + +## Highlights + +### Format changes + +[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic covering all the format changes proposals, +which are also partly covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The following are the main +changes in this release: + +#### Timeline + +No major changes in this release. Refer to [1.0.0-beta1#timeline](release-1.0.0-beta1.md#timeline) for more details. + +#### Log File Format + +In addition to the fields in the log file header added in [1.0.0-beta1](release-1.0.0-beta1.md#log-file-format), we also +store a flag, `IS_PARTIAL` to indicate whether the log block contains partial updates or not. + +### Metadata indexes + +In 1.0.0-beta1, we added support for functional index. In 1.0.0-beta2, we have added support for secondary indexes and +partition stats index to the [multi-modal indexing](/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi) subsystem. + +#### Secondary Indexes + +Secondary indexes allow users to create indexes on columns that are not part of record key columns in Hudi tables (for +record key fields, Hudi supports [Record-level Index](/blog/2023/11/01/record-level-index). Secondary indexes can be used to speed up +queries with predicate on columns other than record key columns. + +#### Partition Stats Index + +Partition stats index aggregates statistics at the partition level for the columns for which it is enabled. This helps +in efficient partition pruning even for non-partition fields. + +To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-index). + +### API Changes + +#### Positional Merging + +In 1.0.0-beta1, we added a new [filegroup reader](/releases/release-1.0.0-beta1#new-filegroup-reader). The reader now +provides position-based merging, as an alternative to existing key-based merging, and skipping pages based on record +positions. The new filegroup reader is integrated with Spark and Hive, and enabled by default. To enable positional +merging set below configs: + +```properties Review Comment: not related to this doc PR. curious in general. if we have fallback mechanism to do key based merges if positional based merges are not possible, why not we enable this by default? ########## website/docs/metadata.md: ########## @@ -90,6 +90,32 @@ Following are the different indices currently available under the metadata table Hudi release, this index aids in locating records faster than other existing indices and can provide a speedup orders of magnitude faster in large deployments where index lookup dominates write latencies. +#### New Indexes in 1.0.0 + +- ***Functional Index***: + A [functional index](https://github.com/apache/hudi/blob/3789840be3d041cbcfc6b24786740210e4e6d6ac/rfc/rfc-63/rfc-63.md) + is an index on a function of a column. If a query has a predicate on a function of a column, the functional index can + be used to speed up the query. Functional index is stored in *func_index_* prefixed partitions (one for each + function) under metadata table. Functional index can be created using SQL syntax. Please checkout SQL DDL + docs [here](/docs/next/sql_ddl#create-functional-index) for more details. + +- ***Partition Stats Index*** + Partition stats index aggregates statistics at the partition level for the columns for which it is enabled. This helps + in efficient partition pruning even for non-partition fields. The partition stats index is stored in *partition_stats* + partition under metadata table. Partition stats index can be enabled using the following configs (note it is required + to specify the columns for which stats should be aggregated): + ```properties + hoodie.metadata.index.partition.stats.enable=true + hoodie.metadata.index.column.stats.columns=<comma-separated-column-names> + ``` + +- ***Secondary Index***: + Secondary indexes allow users to create indexes on columns that are not part of record key columns in Hudi tables (for + record key fields, Hudi supports [Record-level Index](/blog/2023/11/01/record-level-index). Secondary indexes + can be used to speed up queries with predicate on columns other than record key columns. + +To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-index). Review Comment: don't we have a separate section for sec index? this is referring to partition stats index? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
