[GitHub] [hudi] rmahindra123 commented on a diff in pull request #9790: [WIP][DO NOT MERGE][DOCS] Add release notes for 0.14.0

via GitHub Wed, 27 Sep 2023 22:28:07 -0700


rmahindra123 commented on code in PR #9790:
URL: https://github.com/apache/hudi/pull/9790#discussion_r1339545821



##########
website/releases/release-0.14.0.md:
##########
@@ -0,0 +1,319 @@
+---
+title: "Release 0.14.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) 
([docs](/docs/quick-start-guide))
+Apache Hudi 0.14.0 marks a significant milestone with a range of new 
functionalities and enhancements. 
+These include the introduction of Record Level Index, automatic generation of 
record keys, the `hudi_table_changes` 
+function for incremental reads, and more. Notably, this release also 
incorporates support for Spark 3.4. On the Flink 
+front, version 0.14.0 brings several exciting features such as consistent 
hashing index support, Flink 1.17 support, and U
+pdate and Delete statement support. Additionally, this release upgrades the 
Hudi table version, prompting users to consult
+the Migration Guide provided below. We encourage users to review the release 
highlights(TODO Add link),
+[breaking changes](#migration-guide-breaking-changes), and [behavior 
changes](#migration-guide-behavior-changes) before 
+adopting the 0.14.0 release.
+
+
+
+## Migration Guide
+In version 0.14.0, we've made changes such as the removal of compaction plans 
from the ".aux" folder and the introduction
+of a new log block version. As part of this release, the table version is 
updated to version `6`. When running a Hudi job 
+with version 0.14.0 on a table with an older table version, an automatic 
upgrade process is triggered to bring the table 
+up to version `6`. This upgrade is a one-time occurrence for each Hudi table, 
as the `hoodie.table.version` is updated in
+the property file upon completion of the upgrade. Additionally, a command-line 
tool for downgrading has been included, 
+allowing users to move from table version `6` to `5`, or revert from Hudi 
0.14.0 to a version prior to 0.14.0. To use this 
+tool, execute it from a 0.14.0 environment (TODO Check what is this tool). For 
more details, refer to the 

Review Comment:
   @lokeshj1703 should be able to help with getting the actual java class or 
alike for the tool. I am assuming that's what your are asking for?



##########
website/releases/release-0.14.0.md:
##########
@@ -0,0 +1,319 @@
+---
+title: "Release 0.14.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) 
([docs](/docs/quick-start-guide))
+Apache Hudi 0.14.0 marks a significant milestone with a range of new 
functionalities and enhancements. 
+These include the introduction of Record Level Index, automatic generation of 
record keys, the `hudi_table_changes` 
+function for incremental reads, and more. Notably, this release also 
incorporates support for Spark 3.4. On the Flink 
+front, version 0.14.0 brings several exciting features such as consistent 
hashing index support, Flink 1.17 support, and U
+pdate and Delete statement support. Additionally, this release upgrades the 
Hudi table version, prompting users to consult
+the Migration Guide provided below. We encourage users to review the release 
highlights(TODO Add link),
+[breaking changes](#migration-guide-breaking-changes), and [behavior 
changes](#migration-guide-behavior-changes) before 
+adopting the 0.14.0 release.
+
+
+
+## Migration Guide
+In version 0.14.0, we've made changes such as the removal of compaction plans 
from the ".aux" folder and the introduction
+of a new log block version. As part of this release, the table version is 
updated to version `6`. When running a Hudi job 
+with version 0.14.0 on a table with an older table version, an automatic 
upgrade process is triggered to bring the table 
+up to version `6`. This upgrade is a one-time occurrence for each Hudi table, 
as the `hoodie.table.version` is updated in
+the property file upon completion of the upgrade. Additionally, a command-line 
tool for downgrading has been included, 
+allowing users to move from table version `6` to `5`, or revert from Hudi 
0.14.0 to a version prior to 0.14.0. To use this 
+tool, execute it from a 0.14.0 environment (TODO Check what is this tool). For 
more details, refer to the 
+[hudi-cli](#/docs/cli/#upgrade-and-downgrade-table).
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+### Bundle Updates
+
+#### New Spark Bundles
+In this release, we've expanded our support to include bundles for both Spark 
3.4 (hudi-spark3.4-bundle_2.12) (TODO Add 
+link) and Spark 3.0 (hudi-spark3.0-bundle_2.12) (TODO Add link). Please note 
that, the support for Spark 3.0 had been 
+discontinued after Hudi version 0.10.1, but due to strong community interest, 
it has been reinstated in this release.
+
+### Breaking Changes
+
+#### INSERT INTO behavior with spark-sql
+Before version 0.14.0, data ingested through `INSERT INTO` in Spark SQL 
followed the upsert flow, where multiple versions 
+of records would be merged into one version. However, starting from 0.14.0, 
we've altered the default behavior of 
+`INSERT INTO` to utilize the `insert` flow internally. This change 
significantly enhances write performance as it 
+bypasses index lookups.
+
+If a table is created with a *preCombine* key, the default operation for 
`INSERT INTO` remains as `upsert`. Conversely, 
+if no *preCombine* key is set, the underlying write operation for `INSERT 
INTO` defaults to `insert`. Users have the 
+flexibility to override this behavior by explicitly setting values for the 
config 
+[`hoodie.spark.sql.insert.into.operation`](https://hudi.apache.org/docs/configurations#hoodiesparksqlinsertintooperation)
 
+as per their requirements. Possible values for this config include `insert`, 
`bulk_insert`, and `upsert`.
+
+Additionally, in version 0.14.0, we have **deprecated** two related older 
configs:
+- `hoodie.sql.insert.mode`
+- `hoodie.sql.bulk.insert.enable`.
+
+### Behavior changes
+
+#### Simplified duplicates handling with Inserts
+In cases where the operation type is configured as `insert`, users now have 
the option to enforce a duplicate policy 
+using the configuration setting 
+[`hoodie.datasource.insert.dup.policy`](https://hudi.apache.org/docs/configurations#hoodiedatasourceinsertduppolicy).
 
+This policy determines the action taken when incoming records being ingested 
already exist in storage. The available 
+values for this configuration are as follows:
+
+- `none`: No specific action is taken, allowing duplicates to exist in the 
Hudi table if the incoming records contain duplicates.
+- `drop`: Matching records from the incoming writes will be dropped, and the 
remaining ones will be ingested.
+- `fail`: The write operation will fail if the same records are re-ingested. 
In essence, a given record, as determined 
+by the key generation policy, can only be ingested once into the target table.
+
+With this addition, an older related configuration setting, 
+[`hoodie.datasource.write.insert.drop.duplicates`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates),
 
+is now deprecated. The newer configuration will take precedence over the old 
one when both are specified. If no specific 
+configurations are provided, the default value for the newer configuration 
will be assumed. Users are strongly encouraged 
+to migrate to the use of these newer configurations. (TODO: Verify this 
behavior)
+
+
+#### Compaction with MOR table
+For Spark batch writers (both the Spark datasource and Spark SQL), compaction 
is automatically enabled by default for 
+MOR (Merge On Read) tables, unless users explicitly override this behavior. 
Users have the option to disable compaction 
+explicitly by setting 
[`hoodie.compact.inline`](https://hudi.apache.org/docs/configurations#hoodiecompactinline)
 to false. 
+In case users do not override this configuration, compaction may be triggered 
for MOR tables approximately once every 
+5 delta commits (the default value for 
+[`hoodie.compact.inline.max.delta.commits`](https://hudi.apache.org/docs/configurations#hoodiecompactinlinemaxdeltacommits)).
+
+
+#### HoodieDeltaStreamer renamed to HoodieStreamer
+Starting from version 0.14.0, we have renamed 
[HoodieDeltaStreamer](https://github.com/apache/hudi/blob/84a80e21b5f0cdc1f4a33957293272431b221aa9/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java)
+to 
[HoodieStreamer](https://github.com/apache/hudi/blob/84a80e21b5f0cdc1f4a33957293272431b221aa9/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamer.java).
 
+We have ensured backward compatibility so that existing user jobs remain 
unaffected. However, in upcoming 
+releases, support for Deltastreamer might be discontinued. Hence, we strongly 
advise users to transition to using 
+HoodieStreamer instead.
+
+
+#### Merge Into Join condition 
+Starting from version 0.14.0, Hudi has the capability to automatically 
generate primary record keys when users do not 
+provide explicit specifications. This enhancement enables the `MERGE INTO 
JOIN` clause to reference any data column for 
+the join condition in Hudi tables where the primary keys are generated by Hudi 
itself. However, in cases where users 
+configure the primary record key, the join condition still expects the primary 
key fields as specified by the user.
+
+
+## Release Highlights
+
+### Record Level Index 
+TODO REVISIT THIS
+Hudi 0.14.0, introduces Record Level Index a new index implementation which 
significantly enhance write performance. 
+This index stores a mapping of per-record locations and efficiently retrieves 
them during index lookup operations. 
+It can serve as a replacement for other [Global 
indices](https://hudi.apache.org/docs/next/indexing#global-and-non-global-indexes)
 
+such as Global_bloom, Global_Simple, or Hbase used in Hudi. Adopting the 
Record Level Index can potentially boost index 
+lookup performance by 4 to 10 times, depending on the workload, even for 
extremely large-scale datasets (e.g., 1TB). 
+With the Record Level Index, substantial performance improvements can be 
observed for large datasets, as latency is 
+directly proportional to the amount of data being ingested. This is in 
contrast to other Global indices where index 
+lookup time increases linearly with the table size. The Record Level Index is 
designed to efficiently handle lookups 
+for such large-scale data without a linear increase in lookup times as the 
table size grows. To leverage this blazing 
+fast index, users need to enable two configurations:
+- 
[`hoodie.metadata.record.index.enable`](https://hudi.apache.org/docs/next/configurations#hoodiemetadatarecordindexenable)
+  must be enabled to write the Record Level Index to the metadata table.
+- `hoodie.index.type` needs to be set to `RECORD_INDEX` for the index lookup 
to utilize the Record Level Index.
+
+
+### Support for Hudi tables with Autogenerated keys
+Since the initial official version of Hudi, the primary key was a mandatory 
field that users needed to configure for any 
+Hudi table. Starting 0.14.0, we are relaxing this constraint. This enhancement 
addresses a longstanding need within the 
+community, where certain use-cases didn't naturally possess an intrinsic 
primary key. Version 0.14.0 now offers the 
+flexibility for users to create a Hudi table without the need to explicitly 
configure a primary key (by omitting the 
+configuration setting -
+[`hoodie.datasource.write.recordkey.field`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriterecordkeyfield)).
 
+Hudi will **automatically generate the primary keys** in such cases. This 
feature is applicable only for new tables and 
+cannot be altered for existing ones.
+
+
+(TODO CHECK THIS SUPPORT)

Review Comment:
   I ll need to check if all these flows were tested. 



##########
website/releases/release-0.14.0.md:
##########
@@ -0,0 +1,319 @@
+---
+title: "Release 0.14.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) 
([docs](/docs/quick-start-guide))
+Apache Hudi 0.14.0 marks a significant milestone with a range of new 
functionalities and enhancements. 
+These include the introduction of Record Level Index, automatic generation of 
record keys, the `hudi_table_changes` 
+function for incremental reads, and more. Notably, this release also 
incorporates support for Spark 3.4. On the Flink 
+front, version 0.14.0 brings several exciting features such as consistent 
hashing index support, Flink 1.17 support, and U
+pdate and Delete statement support. Additionally, this release upgrades the 
Hudi table version, prompting users to consult
+the Migration Guide provided below. We encourage users to review the release 
highlights(TODO Add link),
+[breaking changes](#migration-guide-breaking-changes), and [behavior 
changes](#migration-guide-behavior-changes) before 
+adopting the 0.14.0 release.
+
+
+
+## Migration Guide
+In version 0.14.0, we've made changes such as the removal of compaction plans 
from the ".aux" folder and the introduction
+of a new log block version. As part of this release, the table version is 
updated to version `6`. When running a Hudi job 
+with version 0.14.0 on a table with an older table version, an automatic 
upgrade process is triggered to bring the table 
+up to version `6`. This upgrade is a one-time occurrence for each Hudi table, 
as the `hoodie.table.version` is updated in
+the property file upon completion of the upgrade. Additionally, a command-line 
tool for downgrading has been included, 
+allowing users to move from table version `6` to `5`, or revert from Hudi 
0.14.0 to a version prior to 0.14.0. To use this 
+tool, execute it from a 0.14.0 environment (TODO Check what is this tool). For 
more details, refer to the 
+[hudi-cli](#/docs/cli/#upgrade-and-downgrade-table).
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+### Bundle Updates
+
+#### New Spark Bundles
+In this release, we've expanded our support to include bundles for both Spark 
3.4 (hudi-spark3.4-bundle_2.12) (TODO Add 
+link) and Spark 3.0 (hudi-spark3.0-bundle_2.12) (TODO Add link). Please note 
that, the support for Spark 3.0 had been 
+discontinued after Hudi version 0.10.1, but due to strong community interest, 
it has been reinstated in this release.
+
+### Breaking Changes
+
+#### INSERT INTO behavior with spark-sql
+Before version 0.14.0, data ingested through `INSERT INTO` in Spark SQL 
followed the upsert flow, where multiple versions 
+of records would be merged into one version. However, starting from 0.14.0, 
we've altered the default behavior of 
+`INSERT INTO` to utilize the `insert` flow internally. This change 
significantly enhances write performance as it 
+bypasses index lookups.
+
+If a table is created with a *preCombine* key, the default operation for 
`INSERT INTO` remains as `upsert`. Conversely, 
+if no *preCombine* key is set, the underlying write operation for `INSERT 
INTO` defaults to `insert`. Users have the 
+flexibility to override this behavior by explicitly setting values for the 
config 
+[`hoodie.spark.sql.insert.into.operation`](https://hudi.apache.org/docs/configurations#hoodiesparksqlinsertintooperation)
 
+as per their requirements. Possible values for this config include `insert`, 
`bulk_insert`, and `upsert`.
+
+Additionally, in version 0.14.0, we have **deprecated** two related older 
configs:
+- `hoodie.sql.insert.mode`
+- `hoodie.sql.bulk.insert.enable`.
+
+### Behavior changes
+
+#### Simplified duplicates handling with Inserts
+In cases where the operation type is configured as `insert`, users now have 
the option to enforce a duplicate policy 
+using the configuration setting 
+[`hoodie.datasource.insert.dup.policy`](https://hudi.apache.org/docs/configurations#hoodiedatasourceinsertduppolicy).
 
+This policy determines the action taken when incoming records being ingested 
already exist in storage. The available 
+values for this configuration are as follows:
+
+- `none`: No specific action is taken, allowing duplicates to exist in the 
Hudi table if the incoming records contain duplicates.
+- `drop`: Matching records from the incoming writes will be dropped, and the 
remaining ones will be ingested.
+- `fail`: The write operation will fail if the same records are re-ingested. 
In essence, a given record, as determined 
+by the key generation policy, can only be ingested once into the target table.
+
+With this addition, an older related configuration setting, 
+[`hoodie.datasource.write.insert.drop.duplicates`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates),
 
+is now deprecated. The newer configuration will take precedence over the old 
one when both are specified. If no specific 
+configurations are provided, the default value for the newer configuration 
will be assumed. Users are strongly encouraged 
+to migrate to the use of these newer configurations. (TODO: Verify this 
behavior)

Review Comment:
   I think the new behavior was well tested, but you can confirm with @jonvex 
or @ad1happy2go 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rmahindra123 commented on a diff in pull request #9790: [WIP][DO NOT MERGE][DOCS] Add release notes for 0.14.0

Reply via email to