codope commented on code in PR #9790: URL: https://github.com/apache/hudi/pull/9790#discussion_r1345164763
########## website/releases/release-0.14.0.md: ########## @@ -0,0 +1,339 @@ +--- +title: "Release 0.14.0" +sidebar_position: 1 +layout: releases +toc: true +--- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +## [Release 0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) ([docs](/docs/quick-start-guide)) +Apache Hudi 0.14.0 marks a significant milestone with a range of new functionalities and enhancements. +These include the introduction of Record Level Index, automatic generation of record keys, the `hudi_table_changes` +function for incremental reads, and more. Notably, this release also incorporates support for Spark 3.4. On the Flink +front, version 0.14.0 brings several exciting features such as consistent hashing index support, Flink 1.17 support, and U +pdate and Delete statement support. Additionally, this release upgrades the Hudi table version, prompting users to consult +the Migration Guide provided below. We encourage users to review the [release highlights](#release-highlights), +[breaking changes](#breaking-changes), and [behavior changes](#behavior-changes) before +adopting the 0.14.0 release. + + + +## Migration Guide +In version 0.14.0, we've made changes such as the removal of compaction plans from the ".aux" folder and the introduction +of a new log block version. As part of this release, the table version is updated to version `6`. When running a Hudi job +with version 0.14.0 on a table with an older table version, an automatic upgrade process is triggered to bring the table +up to version `6`. This upgrade is a one-time occurrence for each Hudi table, as the `hoodie.table.version` is updated in +the property file upon completion of the upgrade. Additionally, a command-line tool for downgrading has been included, +allowing users to move from table version `6` to `5`, or revert from Hudi 0.14.0 to a version prior to 0.14.0. To use this +tool, execute it from a 0.14.0 environment. For more details, refer to the +[hudi-cli](/docs/cli/#upgrade-and-downgrade-table). + +:::caution +If migrating from an older release (pre 0.14.0), please also check the upgrade instructions from each older release in +sequence. +::: + +### Bundle Updates + +#### New Spark Bundles +In this release, we've expanded our support to include bundles for both Spark 3.4 +([hudi-spark3.4-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.4-bundle_2.12)) +and Spark 3.0 ([hudi-spark3.0-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.0-bundle_2.12)). +Please note that, the support for Spark 3.0 had been discontinued after Hudi version 0.10.1, but due to strong community +interest, it has been reinstated in this release. + +### Breaking Changes + +#### INSERT INTO behavior with Spark SQL +Before version 0.14.0, data ingested through `INSERT INTO` in Spark SQL followed the upsert flow, where multiple versions +of records would be merged into one version. However, starting from 0.14.0, we've altered the default behavior of +`INSERT INTO` to utilize the `insert` flow internally. This change significantly enhances write performance as it +bypasses index lookups. + +If a table is created with a *preCombine* key, the default operation for `INSERT INTO` remains as `upsert`. Conversely, +if no *preCombine* key is set, the underlying write operation for `INSERT INTO` defaults to `insert`. Users have the +flexibility to override this behavior by explicitly setting values for the config +[`hoodie.spark.sql.insert.into.operation`](https://hudi.apache.org/docs/configurations#hoodiesparksqlinsertintooperation) +as per their requirements. Possible values for this config include `insert`, `bulk_insert`, and `upsert`. + +Additionally, in version 0.14.0, we have **deprecated** two related older configs: +- `hoodie.sql.insert.mode` +- `hoodie.sql.bulk.insert.enable`. + +### Behavior changes + +#### Simplified duplicates handling with Inserts in Spark SQL +In cases where the operation type is configured as `insert` for the Spark SQL `INSERT INTO` flow, users now have the +option to enforce a duplicate policy using the configuration setting +[`hoodie.datasource.insert.dup.policy`](https://hudi.apache.org/docs/configurations#hoodiedatasourceinsertduppolicy). +This policy determines the action taken when incoming records being ingested already exist in storage. The available +values for this configuration are as follows: + +- `none`: No specific action is taken, allowing duplicates to exist in the Hudi table if the incoming records contain duplicates. +- `drop`: Matching records from the incoming writes will be dropped, and the remaining ones will be ingested. +- `fail`: The write operation will fail if the same records are re-ingested. In essence, a given record, as determined +by the key generation policy, can only be ingested once into the target table. + +With this addition, an older related configuration setting, +[`hoodie.datasource.write.insert.drop.duplicates`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates), +is now deprecated. The newer configuration will take precedence over the old one when both are specified. If no specific +configurations are provided, the default value for the newer configuration will be assumed. Users are strongly encouraged +to migrate to the use of these newer configurations. + + +#### Compaction with MOR table +For Spark batch writers (both the Spark datasource and Spark SQL), compaction is automatically enabled by default for +MOR (Merge On Read) tables, unless users explicitly override this behavior. Users have the option to disable compaction +explicitly by setting [`hoodie.compact.inline`](https://hudi.apache.org/docs/configurations#hoodiecompactinline) to false. +In case users do not override this configuration, compaction may be triggered for MOR tables approximately once every +5 delta commits (the default value for +[`hoodie.compact.inline.max.delta.commits`](https://hudi.apache.org/docs/configurations#hoodiecompactinlinemaxdeltacommits)). + + +#### HoodieDeltaStreamer renamed to HoodieStreamer +Starting from version 0.14.0, we have renamed [HoodieDeltaStreamer](https://github.com/apache/hudi/blob/84a80e21b5f0cdc1f4a33957293272431b221aa9/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java) +to [HoodieStreamer](https://github.com/apache/hudi/blob/84a80e21b5f0cdc1f4a33957293272431b221aa9/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamer.java). +We have ensured backward compatibility so that existing user jobs remain unaffected. However, in upcoming +releases, support for Deltastreamer might be discontinued. Hence, we strongly advise users to transition to using +HoodieStreamer instead. + + +#### MERGE INTO JOIN condition +Starting from version 0.14.0, Hudi has the capability to automatically generate primary record keys when users do not +provide explicit specifications. This enhancement enables the `MERGE INTO JOIN` clause to reference any data column for +the join condition in Hudi tables where the primary keys are generated by Hudi itself. However, in cases where users +configure the primary record key, the join condition still expects the primary key fields as specified by the user. + + +## Release Highlights + +### Record Level Index +Hudi version 0.14.0, introduces a new index implementation - +[Record Level Index](https://github.com/apache/hudi/blob/master/rfc/rfc-8/rfc-8.md#rfc-8-metadata-based-record-index). +The Record level Index significantly enhances write performance for large tables by efficiently storing per-record +locations and enabling swift retrieval during index lookup operations. It can effectively replace other +[Global indices](https://hudi.apache.org/docs/next/indexing#global-and-non-global-indexes) like Global_bloom, +Global_Simple, or Hbase, commonly used in Hudi. + +Bloom and Simple Indexes exhibit slower performance for large datasets due to the high costs associated with gathering +index data from various data files during lookup. Moreover, these indexes do not preserve a one-to-one record-key to +record file path mapping; instead, they deduce the mapping through an optimized search at lookup time. The per-file +overhead required by these indexes makes them less effective for datasets with a larger number of files or records. + +On the other hand, the Hbase Index saves a one-to-one mapping for each record key, resulting in fast performance that +scales with the dataset size. However, it necessitates a separate HBase cluster for maintenance, which is operationally +challenging and resource-intensive, requiring specialized expertise. + +The Record Index combines the speed and scalability of the HBase Index without its limitations and overhead. Being a +part of the HUDI Metadata Table, any future performance enhancements in writes and queries will automatically translate +into improved performance for the Record Index. Adopting the Record Level Index has the potential to boost index lookup +performance by 4 to 10 times, depending on the workload, even for extremely large-scale datasets (e.g., 1TB). + +With the Record Level Index, significant performance improvements can be observed for large datasets, as latency is +directly proportional to the amount of data being ingested. This is in contrast to other Global indices where index +lookup time increases linearly with the table size. The Record Level Index is specifically designed to efficiently +handle lookups for such large-scale data without a linear increase in lookup times as the table size grows. + +To harness the benefits of this lightning-fast index, users need to enable two configurations: +- [`hoodie.metadata.record.index.enable`](https://hudi.apache.org/docs/next/configurations#hoodiemetadatarecordindexenable) must be enabled to write the Record Level Index to the metadata table. +- `hoodie.index.type` needs to be set to `RECORD_INDEX` for the index lookup to utilize the Record Level Index. + + +### Support for Hudi tables with Autogenerated keys +Since the initial official version of Hudi, the primary key was a mandatory field that users needed to configure for any +Hudi table. Starting 0.14.0, we are relaxing this constraint. This enhancement addresses a longstanding need within the +community, where certain use-cases didn't naturally possess an intrinsic primary key. Version 0.14.0 now offers the +flexibility for users to create a Hudi table without the need to explicitly configure a primary key (by omitting the +configuration setting - +[`hoodie.datasource.write.recordkey.field`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriterecordkeyfield)). +Hudi will **automatically generate the primary keys** in such cases. This feature is applicable only for new tables and +cannot be altered for existing ones. + + +This functionality is available in all spark writers with certain limitations. For append only type of use cases, Inserts and +bulk_inserts are allowed with all four writers - Spark Datasource, Spark SQL, Spark Streaming, Hoodie Streamer. Updates and +Deletes are supported only using spark-sql `MERGE INTO` , `UPDATE` and `DELETE` statements. With Spark Datasource, `UPDATE` +and `DELETE` are supported only when the source dataframe contains Hudi's meta fields. Please check out our +[quick start guide](https://hudi.apache.org/docs/quick-start-guide) for code snippets on Hudi table CRUD operations where +keys are autogenerated. + +### Spark 3.4 version support + +Spark 3.4 support is added; users who are on Spark 3.4 can use +[hudi-spark3.4-bundle](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.4-bundle). Spark 3.2, Spark 3.1, +Spark3.0 and Spark 2.4 will continue to be supported. Please check the migration guide for bundle updates. To quickly get +started with Hudi and Spark 3.4, you can explore our [quick start guide](https://hudi.apache.org/docs/quick-start-guide). + +### Query side improvements: + +#### Metadata table support with Athena + +Users now have the ability to utilize Hudi’s [Metadata table](https://hudi.apache.org/docs/metadata/) seamlessly with Athena. +The file listing index removes the need for recursive file system calls like "list files" by retrieving information +from an index that maintains a mapping of partitions to files. This approach proves to be highly efficient, particularly +when dealing with extensive datasets. With Hudi 0.14.0, users can activate file listing based on the metadata table when +performing Glue catalog synchronization for their Hudi tables. To enable this functionality, users can configure +`hoodie.datasource.meta.sync.glue.metadata_file_listing` and set it to true during the Glue sync process. + +#### Leverage Parquet bloom filters w/ read queries +In Hudi 0.14.0, users can now utilize the native +[Parquet bloom filters](https://github.com/apache/parquet-format/blob/1603152f8991809e8ad29659dffa224b4284f31b/BloomFilter.md), +provided their compute engine supports Apache Parquet 1.12.0 or higher. This support covers both the writing and reading +of datasets. Hudi facilitates the use of native Parquet bloom filters through Hadoop configuration. Users are required +to set a Hadoop configuration with a specific key representing the column for which the bloom filter is to be applied. +For example, `parquet.bloom.filter.enabled#rider=true` creates a bloom filter for the rider column. Whenever a query +involves a predicate on the rider column, the bloom filter comes into play, enhancing read performance. + +#### Incremental queries with multi-writers +In multi-writer scenarios, there can be instances of gaps in the timeline (requests or inflight instants that are not +the latest instant) due to concurrent writing activities. These gaps may result in inconsistent outcomes when +performing incremental queries. To address this issue, Hudi 0.14.0 introduces a new configuration setting, +[`hoodie.read.timeline.holes.resolution.policy`](https://hudi.apache.org/docs/configurations#hoodiereadtimelineholesresolutionpolicy), +specifically designed for handling these inconsistencies in incremental queries. The configuration provides three possible policies: +- `FAIL`: This serves as the default policy and throws an exception when such timeline gaps are identified during an incremental query. +- `BLOCK`: In this policy, the results of an incremental query are limited to the time range between the holes in the + timeline. For instance, if a gap is detected at instant t1 within the incremental query range from t0 to t2, the + query will only display results between t0 and t1 without failing. +- `USE_TRANSITION_TIME`: This policy is experimental and involves using the state transition time, which is based on the + file modification time of commit metadata files in the timeline, during the incremental query. + +#### Timestamp support with Hive 3.x +For quite some time, Hudi users encountered [challenges](https://issues.apache.org/jira/browse/HUDI-83) regarding reading Timestamp type columns written by Spark and +subsequently attempting to read them with Hive 3.x. While in Hudi 0.13.x, we introduced a +[workaround](https://github.com/apache/hudi/commit/cd314b8cfa58c32f731f7da2aa6377a09df4c6f9#diff-cff4dfc264f7abcac63a5ba5db55b38115177fe279ab35807d345c2b8872475e) +to mitigate this issue, version 0.14.0 now ensures full compatibility of HiveAvroSerializer with Hive 3.x to resolve this. + +#### Google BigQuery sync enhancements +With 0.14.0, the [BigQuerySyncTool](https://hudi.apache.org/docs/gcp_bigquery) supports syncing table to BigQuery +using [manifests](https://cloud.google.com/blog/products/data-analytics/bigquery-manifest-file-support-for-open-table-format-queries). +This is expected to have better query performance compared to legacy way. Schema evolution is supported with the manifest approach. +Partition column no longer needs to be dropped from the files due to new schema handling improvements. To enable this +feature, users can set +[`hoodie.gcp.bigquery.sync.use_bq_manifest_file`](https://hudi.apache.org/docs/configurations#hoodiegcpbigquerysyncuse_bq_manifest_file) +to true. + +### Spark read side improvements + Review Comment: In the spark query side improvement, should we talk about RLI benefits for keyed lookups and deleted? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
