codope commented on code in PR #10093: URL: https://github.com/apache/hudi/pull/10093#discussion_r1393592818
########## website/releases/release-1.0.0-beta1.md: ########## @@ -0,0 +1,121 @@ +--- +title: "Release 1.0.0-beta1" +sidebar_position: 1 +layout: releases +toc: true +--- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +## [Release 1.0.0-beta1](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta1) ([docs](/docs/next/quick-start-guide)) + +Apache Hudi 1.0.0-beta1 is the first beta release of Apache Hudi. This release is meant for early adopters to try +out the new features and provide feedback. The release is not meant for production use. + +## Migration Guide + +This release contains major format changes as we will see in highlights below. As such, migration would be required when +the release is made generally available (GA). However, we encourage users to try out the features on new tables. + +:::caution +If migrating from an older release (pre 0.14.0), please also check the upgrade instructions from each older release in +sequence. +::: + +## Highlights + +### Format changes + +[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic covering all the format changes proposals. The following are the main changes in this release: + +#### Timeline + +- Now all commit metadata is serialized to avro. This allows us to add new fields in the future without breaking + compatibility and also maintain uniformity in metadata across all actions. +- All completed commit metadata file name will also have completion time. All the actions in requested/inflight states + are stored in the active timeline as files named <begin_instant_time>.<action_type>.<requested|inflight>. Completed + actions are stored along with a time that denotes when the action was completed, in a file named < + begin_instant_time>_<completion_instant_time>.<action_type>. This allows us to implement file slicing for non-blocking + concurrecy control. +- Completed actions, their plans and completion metadata are stored in a more + scalable [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) based timeline organized in an * + *_archived_** storage location under the .hoodie metadata path. It consists of Apache Parquet files with action + instant data and bookkeeping metadata files, in the following manner. Checkout [timeline](/docs/next/timeline) docs for more details. + +#### Log File Format + +- Now in addition to the fields in the log file header mentioned in the [spec](https://hudi.apache.org/tech-specs/#log-file-format), + we also store the record positions in the header. This allows us to do position-based merging (apart from key-based merging) and skip pages based on positions. +- Log file name will now have the deltacommit instant time instead of base commit instant time. + +#### Multiple base file formats + +Now you can have multiple base files formats in a Hudi table. Even the same filegroup can have multiple base file +formats. We need to set a table config `hoodie.table.multiple.base.file.formats.enable` to use this features. And +whenever we need to change the format, then just specify the format in the `hoodie.base.file.format"` config. Currently, +only Parquet, Orc and HFile formats are supported. Review Comment: Added `This unlocks multiple benefits including choosing file format suitable to index, and supporting emerging formats for ML/AI such as [Lance](https://github.com/lancedb/lance) format.` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
