Re: [PR] [HUDI-7092] Release notes for 1.0.0-beta1 release [hudi]

via GitHub Tue, 14 Nov 2023 11:50:44 -0800


codope commented on code in PR #10093:
URL: https://github.com/apache/hudi/pull/10093#discussion_r1393179359



##########
website/releases/release-1.0.0-beta1.md:
##########
@@ -0,0 +1,121 @@
+---
+title: "Release 1.0.0-beta1"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
1.0.0-beta1](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta1) 
([docs](/docs/next/quick-start-guide))
+
+Apache Hudi 1.0.0-beta1 is the first beta release of Apache Hudi. This release 
is meant for early adopters to try
+out the new features and provide feedback. The release is not meant for 
production use.
+
+## Migration Guide
+
+This release contains major format changes as we will see in highlights below. 
As such, migration would be required when
+the release is made generally available (GA). However, we encourage users to 
try out the features on new tables.
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+## Highlights
+
+### Format changes
+
+[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic 
covering all the format changes proposals. The following are the main changes 
in this release:
+
+#### Timeline
+
+- Now all commit metadata is serialized to avro. This allows us to add new 
fields in the future without breaking
+  compatibility and also maintain uniformity in metadata across all actions.
+- All completed commit metadata file name will also have completion time. All 
the actions in requested/inflight states
+  are stored in the active timeline as files named 
<begin_instant_time>.<action_type>.<requested|inflight>. Completed
+  actions are stored along with a time that denotes when the action was 
completed, in a file named <
+  begin_instant_time>_<completion_instant_time>.<action_type>. This allows us 
to implement file slicing for non-blocking
+  concurrecy control.
+- Completed actions, their plans and completion metadata are stored in a more
+  scalable [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) 
based timeline organized in an *
+  *_archived_** storage location under the .hoodie metadata path. It consists 
of Apache Parquet files with action
+  instant data and bookkeeping metadata files, in the following manner. 
Checkout [timeline](/docs/next/timeline) docs for more details.

Review Comment:
   add exact LSM timeline link after #10087 is merged.



##########
website/releases/release-1.0.0-beta1.md:
##########
@@ -0,0 +1,121 @@
+---
+title: "Release 1.0.0-beta1"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
1.0.0-beta1](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta1) 
([docs](/docs/next/quick-start-guide))
+
+Apache Hudi 1.0.0-beta1 is the first beta release of Apache Hudi. This release 
is meant for early adopters to try
+out the new features and provide feedback. The release is not meant for 
production use.
+
+## Migration Guide
+
+This release contains major format changes as we will see in highlights below. 
As such, migration would be required when
+the release is made generally available (GA). However, we encourage users to 
try out the features on new tables.
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+## Highlights
+
+### Format changes
+
+[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic 
covering all the format changes proposals. The following are the main changes 
in this release:
+
+#### Timeline
+
+- Now all commit metadata is serialized to avro. This allows us to add new 
fields in the future without breaking
+  compatibility and also maintain uniformity in metadata across all actions.
+- All completed commit metadata file name will also have completion time. All 
the actions in requested/inflight states
+  are stored in the active timeline as files named 
<begin_instant_time>.<action_type>.<requested|inflight>. Completed
+  actions are stored along with a time that denotes when the action was 
completed, in a file named <
+  begin_instant_time>_<completion_instant_time>.<action_type>. This allows us 
to implement file slicing for non-blocking
+  concurrecy control.
+- Completed actions, their plans and completion metadata are stored in a more
+  scalable [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) 
based timeline organized in an *
+  *_archived_** storage location under the .hoodie metadata path. It consists 
of Apache Parquet files with action
+  instant data and bookkeeping metadata files, in the following manner. 
Checkout [timeline](/docs/next/timeline) docs for more details.
+
+#### Log File Format
+
+- Now in addition to the fields in the log file header mentioned in the 
[spec](https://hudi.apache.org/tech-specs/#log-file-format),
+  we also store the record positions in the header. This allows us to do 
position-based merging (apart from key-based merging) and skip pages based on 
positions.
+- Log file name will now have the deltacommit instant time instead of base 
commit instant time.
+
+#### Multiple base file formats
+
+Now you can have multiple base files formats in a Hudi table. Even the same 
filegroup can have multiple base file
+formats. We need to set a table config 
`hoodie.table.multiple.base.file.formats.enable` to use this features. And
+whenever we need to change the format, then just specify the format in the 
`hoodie.base.file.format"` config. Currently,
+only Parquet, Orc and HFile formats are supported.
+
+### Concurrency Control
+
+A new concurrency control mode called `NON_BLOCKING_CONCURRENCY_CONTROL` is 
introduced in this release, where unlike
+OCC, multiple writers can operate on the table with non-blocking conflict 
resolution. The writers can write into the
+same file group with the conflicts resolved automatically by the query reader 
and the compactor. The new concurrency
+mode is currently available for preview in version 1.0.0-beta only. You can 
read more about it under
+section [Model C: 
Multi-writer](/docs/next/concurrency_control#model-c-multi-writer). A complete 
example with multiple 
+Flink streaming writers is available 
[here](/docs/next/flink-quick-start-guide#non-blocking-concurrency-control). You
+can follow the 
[RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-66/rfc-66.md) and
+the [JIRA](https://issues.apache.org/jira/browse/HUDI-6640) for more details.
+
+### Functional Index
+
+A [functional 
index](https://github.com/apache/hudi/blob/00ece7bce0a4a8d0019721a28049723821e01842/rfc/rfc-63/rfc-63.md)
+is an index on a function of a column. It is a new addition to Hudi's 
[multi-modal 
indexing](https://hudi.apache.org/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi)
+subsystem which provides faster access method and also absorbs partitioning as 
part of the indexing system. Now you can 
+simply create and drop index using SQL syntax as follows:
+
+```sql
+-- Create Index
+CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name 
+[USING index_type] 
+(column_name1 [OPTIONS(key1=value1, key2=value2, ...)], column_name2 
[OPTIONS(key1=value1, key2=value2, ...)], ...) 
+[OPTIONS (key1=value1, key2=value2, ...)]
+
+-- Drop Index
+DROP INDEX [IF EXISTS] index_name ON [TABLE] table_name
+```
+
+- `index_name` is the name of the index to be created or dropped.
+- `table_name` is the name of the table on which the index is created or 
dropped.
+- `index_type` is the type of the index to be created. Currently, only 
`files`, `column_stats` and `bloom_filters` is supported.
+- `column_name` is the name of the column on which the index is created.
+- Both index and column on which the index is created can be qualified with 
some options in the form of key-value pairs.
+
+To see some examples of creating and using a functional index, please checkout 
the Spark SQL DDL
+docs [here](/docs/next/sql_ddl#create-index). You can follow

Review Comment:
   same here - dependent on #10087 



##########
website/releases/release-1.0.0-beta1.md:
##########
@@ -0,0 +1,121 @@
+---
+title: "Release 1.0.0-beta1"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
1.0.0-beta1](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta1) 
([docs](/docs/next/quick-start-guide))
+
+Apache Hudi 1.0.0-beta1 is the first beta release of Apache Hudi. This release 
is meant for early adopters to try
+out the new features and provide feedback. The release is not meant for 
production use.
+
+## Migration Guide
+
+This release contains major format changes as we will see in highlights below. 
As such, migration would be required when
+the release is made generally available (GA). However, we encourage users to 
try out the features on new tables.
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+## Highlights
+
+### Format changes
+
+[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic 
covering all the format changes proposals. The following are the main changes 
in this release:
+
+#### Timeline
+
+- Now all commit metadata is serialized to avro. This allows us to add new 
fields in the future without breaking
+  compatibility and also maintain uniformity in metadata across all actions.
+- All completed commit metadata file name will also have completion time. All 
the actions in requested/inflight states
+  are stored in the active timeline as files named 
<begin_instant_time>.<action_type>.<requested|inflight>. Completed
+  actions are stored along with a time that denotes when the action was 
completed, in a file named <
+  begin_instant_time>_<completion_instant_time>.<action_type>. This allows us 
to implement file slicing for non-blocking
+  concurrecy control.
+- Completed actions, their plans and completion metadata are stored in a more
+  scalable [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) 
based timeline organized in an *
+  *_archived_** storage location under the .hoodie metadata path. It consists 
of Apache Parquet files with action
+  instant data and bookkeeping metadata files, in the following manner. 
Checkout [timeline](/docs/next/timeline) docs for more details.
+
+#### Log File Format
+
+- Now in addition to the fields in the log file header mentioned in the 
[spec](https://hudi.apache.org/tech-specs/#log-file-format),
+  we also store the record positions in the header. This allows us to do 
position-based merging (apart from key-based merging) and skip pages based on 
positions.
+- Log file name will now have the deltacommit instant time instead of base 
commit instant time.
+
+#### Multiple base file formats
+
+Now you can have multiple base files formats in a Hudi table. Even the same 
filegroup can have multiple base file
+formats. We need to set a table config 
`hoodie.table.multiple.base.file.formats.enable` to use this features. And
+whenever we need to change the format, then just specify the format in the 
`hoodie.base.file.format"` config. Currently,
+only Parquet, Orc and HFile formats are supported.
+
+### Concurrency Control
+
+A new concurrency control mode called `NON_BLOCKING_CONCURRENCY_CONTROL` is 
introduced in this release, where unlike
+OCC, multiple writers can operate on the table with non-blocking conflict 
resolution. The writers can write into the
+same file group with the conflicts resolved automatically by the query reader 
and the compactor. The new concurrency
+mode is currently available for preview in version 1.0.0-beta only. You can 
read more about it under
+section [Model C: 
Multi-writer](/docs/next/concurrency_control#model-c-multi-writer). A complete 
example with multiple 
+Flink streaming writers is available 
[here](/docs/next/flink-quick-start-guide#non-blocking-concurrency-control). You

Review Comment:
   same here - dependent on #10087 



##########
website/releases/release-1.0.0-beta1.md:
##########
@@ -0,0 +1,121 @@
+---
+title: "Release 1.0.0-beta1"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
1.0.0-beta1](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta1) 
([docs](/docs/next/quick-start-guide))
+
+Apache Hudi 1.0.0-beta1 is the first beta release of Apache Hudi. This release 
is meant for early adopters to try
+out the new features and provide feedback. The release is not meant for 
production use.
+
+## Migration Guide
+
+This release contains major format changes as we will see in highlights below. 
As such, migration would be required when
+the release is made generally available (GA). However, we encourage users to 
try out the features on new tables.
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+## Highlights
+
+### Format changes
+
+[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic 
covering all the format changes proposals. The following are the main changes 
in this release:
+
+#### Timeline
+
+- Now all commit metadata is serialized to avro. This allows us to add new 
fields in the future without breaking
+  compatibility and also maintain uniformity in metadata across all actions.
+- All completed commit metadata file name will also have completion time. All 
the actions in requested/inflight states
+  are stored in the active timeline as files named 
<begin_instant_time>.<action_type>.<requested|inflight>. Completed
+  actions are stored along with a time that denotes when the action was 
completed, in a file named <
+  begin_instant_time>_<completion_instant_time>.<action_type>. This allows us 
to implement file slicing for non-blocking
+  concurrecy control.
+- Completed actions, their plans and completion metadata are stored in a more
+  scalable [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) 
based timeline organized in an *
+  *_archived_** storage location under the .hoodie metadata path. It consists 
of Apache Parquet files with action
+  instant data and bookkeeping metadata files, in the following manner. 
Checkout [timeline](/docs/next/timeline) docs for more details.
+
+#### Log File Format
+
+- Now in addition to the fields in the log file header mentioned in the 
[spec](https://hudi.apache.org/tech-specs/#log-file-format),
+  we also store the record positions in the header. This allows us to do 
position-based merging (apart from key-based merging) and skip pages based on 
positions.
+- Log file name will now have the deltacommit instant time instead of base 
commit instant time.
+
+#### Multiple base file formats
+
+Now you can have multiple base files formats in a Hudi table. Even the same 
filegroup can have multiple base file
+formats. We need to set a table config 
`hoodie.table.multiple.base.file.formats.enable` to use this features. And
+whenever we need to change the format, then just specify the format in the 
`hoodie.base.file.format"` config. Currently,
+only Parquet, Orc and HFile formats are supported.
+
+### Concurrency Control
+
+A new concurrency control mode called `NON_BLOCKING_CONCURRENCY_CONTROL` is 
introduced in this release, where unlike
+OCC, multiple writers can operate on the table with non-blocking conflict 
resolution. The writers can write into the
+same file group with the conflicts resolved automatically by the query reader 
and the compactor. The new concurrency
+mode is currently available for preview in version 1.0.0-beta only. You can 
read more about it under
+section [Model C: 
Multi-writer](/docs/next/concurrency_control#model-c-multi-writer). A complete 
example with multiple 

Review Comment:
   should be visible after #10087 is merged



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7092] Release notes for 1.0.0-beta1 release [hudi]

Reply via email to