linliu-code commented on code in PR #14300: URL: https://github.com/apache/hudi/pull/14300#discussion_r2539959003
########## website/releases/release-1.1.0.md: ########## @@ -0,0 +1,366 @@ +--- +title: "Release 1.1.0" +layout: releases +toc: true +--- + +## [Release 1.1.0](https://github.com/apache/hudi/releases/tag/release-1.1.0) + +Apache Hudi 1.1.0 is a major release that brings significant performance improvements, new features, and important changes to the platform. This release focuses on enhanced table format support, improved indexing capabilities, expanded engine support, and modernized record merging APIs. + +## Highlights + +- **Pluggable Table Format Framework** - Native integration of multiple table formats with unified metadata management +- **Spark 4.0 and Flink 2.0 Support** - Full support for latest major compute engine versions +- **Enhanced Indexing** - Non-global Record Index, partition-level bucket index, native HFile writer, and Column Stats V2 +- **Table Services Optimization** - Parquet file stitching and incremental scheduling for compaction/clustering +- **Storage-based Lock Provider** - Multi-writer concurrency control without external dependencies +- **Record Merging Evolution** - Deprecation of payload classes in favor of merge modes and merger APIs + +--- + +## New Features + +### Table Format + +#### Pluggable Table Format Support + +Hudi 1.1.0 introduces a new Pluggable Table Format framework that enables native integration of multiple table formats within the system. This foundation includes a base interface for pluggable table formats, designed to simplify extension and allow seamless interoperability across different storage backends. The Metadata Table (MDT) integration has been enhanced to support pluggability, ensuring modularity and unified metadata management across all supported table formats. + +This release brings native Apache Hudi integration through the new framework, allowing users to leverage Hudi's advanced capabilities directly while maintaining consistent semantics and performance. The configuration `hoodie.table.format` is set to `native` by default, which works as the Hudi table format. **No configuration changes are required** for existing and new Hudi tables. As additional table formats are supported in future releases, users will be able to set this configuration to work natively with other formats. + +#### Table Version 9 with Index Versioning + +Hudi 1.1 introduces table version 9 with support for index versioning. Indexes in the Metadata Table (column stats, secondary index, expression index, etc) now have version tracking. In 1.1, these indexes use V2 layouts with enhanced capabilities including comprehensive logical data type support. Tables migrated from older versions will retain V1 index layouts, while new tables created with 1.1 use V2. Both versions remain backward compatible, and no action is required when upgrading to 1.1. + +### Indexing + +#### Non-Global Record Index + +In addition to the global Record Index introduced in 0.14.0, Hudi 1.1 adds a non-global variant that guarantees uniqueness for partition path and record key pairs. This index speeds up lookups in very large partitioned datasets. + +**Prior to 1.1**, only global Record Index was available, configured as: + +```properties +hoodie.metadata.record.index.enable=true +hoodie.index.type=RECORD_INDEX +``` + +**From 1.1 onwards**, both global and non-global variants are available: + +For non-global Record Index: + +- Metadata table: `hoodie.metadata.record.level.index.enable=true` +- Write index: `hoodie.index.type=RECORD_LEVEL_INDEX` + +For global Record Index: + +- Metadata table: `hoodie.metadata.global.record.level.index.enable=true` +- Write index: `hoodie.index.type=GLOBAL_RECORD_LEVEL_INDEX` + +#### Partition-Level Bucket Index + +A new bucket index type that addresses bucket rescaling challenges. Users can set specific bucket numbers for different partitions through a rule engine (regex pattern matching). Existing Bucket Index tables can be upgraded smoothly and seamlessly. + +**Key Configurations**: + +- `hoodie.bucket.index.partition.rule.type` - Rule parser for expressions (default: regex) +- `hoodie.bucket.index.partition.expressions` - Expression and bucket number pairs +- `hoodie.bucket.index.num.buckets` - Default bucket count for partitions + +For more details, see [RFC-89](https://github.com/apache/hudi/pull/12884/files) + +#### Native HFile Writer + +Hudi now includes a native HFile writer, eliminating dependencies on HBase while ensuring compatibility with both HBase's HFile reader and Hudi's native reader. This significantly reduces the size of Hudi's binary bundle and enhances Hudi's ability to optimize HFile performance. + +#### HFile Performance Enhancements + +Multiple enhancements to speed up metadata table reads: + +- **HFile Block Cache** (enabled by default): Caches HFile data blocks on repeated reads within the same JVM, showing ~4x speedup in benchmarks. Configure with `hoodie.hfile.block.cache.enabled` +- **HFile Prefetching**: For files under 50MB (configurable via `hoodie.metadata.file.cache.max.size.mb`), downloads entire HFile upfront rather than multiple RPC calls +- **Bloom Filter Support**: Speeds up HFile lookups by avoiding unnecessary block downloads. Configure with `hoodie.metadata.bloom.filter.enable` + +#### Column Stats V2 with Enhanced Data Type Support + +Column Stats V2 significantly improves support for logical data types during writes and statistics collection. Logical types like decimals (with precision/scale) and timestamps are now preserved with proper metadata, improving accuracy for query planning and predicate pushdown. + +### Table Services + +#### Parquet File Stitching + +An optimization that enables direct copying of RowGroup-level data from Parquet files during operations like clustering, bypassing expensive compression/decompression, encoding/decoding, and column-to-row conversions. This optimization supports proper schema evolution and ensures Hudi metadata are collected and aggregated correctly. Experimental results show a **95% reduction in computational workload** for clustering operations. + +#### Incremental Table Service Scheduling + +Significantly improves performance of compaction and clustering operations on tables with large numbers of partitions. Enabled by default via `hoodie.table.services.incremental.enabled`. + +### Concurrency Control + +#### Storage-based Lock Provider + +A new storage-based lock provider enables Hudi to manage multi-writer concurrency directly using the `.hoodie` directory in the underlying storage, eliminating the need for external lock providers like DynamoDB or ZooKeeper. Currently supports S3 and GCS, with lock information maintained under `.hoodie/.lock`. + +### Writers & Readers + +#### Multiple Ordering Fields + +Support for multiple ordering (pre-combine) fields using comma-separated lists. When records have the same key, Hudi compares the fields in order and keeps the record with the latest values. + +**Configuration**: `hoodie.table.ordering.fields = field1,field2,field3` + +#### Efficient Streaming Reads for Data Blocks + +Support for efficient streaming reads of HoodieDataBlocks (currently for AvroDataBlock) reduces memory usage, improves read stability on HDFS, and lowers the risk of timeouts and OOM errors when reading log files. + +#### ORC Support in FileGroupReader + +Enhanced support for multiple base file formats (ORC and Parquet) in HoodieFileGroupReader. The 1.1.0 release introduced SparkColumnarFileReader trait and MultipleColumnarFileFormatReader to uniformly handle ORC and Parquet records for both Merge-on-Read (MOR) and Copy-on-Write (COW) tables. + +#### Hive Schema Evolution Support + +Hive readers can now handle schema evolution when schema-on-write is used. + +### Spark + +#### Spark 4.0 Support + +Apache Spark 4.0 is now supported with necessary compatibility and dependency changes. Available through the new `hudi-spark4.0-bundle_2.13` release artifact. + +#### Metadata Table Streaming Writes + +Streaming writes to the metadata table enable more efficient metadata record generation by processing data table and metadata table writes in the same execution chain, avoiding on-demand lookups. Enabled by default for Spark via `hoodie.metadata.streaming.write.enabled`. + +#### SQL Procedures Enhancements + +**New CLEAN Procedures**: + +- `show_cleans` - Displays completed cleaning operations with metadata +- `show_clean_plans` - Shows clean operations in all states (REQUESTED, INFLIGHT, COMPLETED) +- `show_cleans_metadata` - Provides partition-level cleaning details + +**Enhanced Capabilities**: + +- Regex pattern support in `run_clustering` via `partition_regex_pattern` parameter +- Base path and filter parameters for all non-action `SHOW` procedures with advanced predicate expressions + +### Flink + +#### Flink 2.0 Support + +Full support for Flink 2.0 including sink, read, catalog, and new bundle artifact `hudi-flink2.0-bundle`. Includes compatibility fixes for legacy APIs and supports sinkV2 API by default. + +**Deprecation**: Removed support for Flink 1.14, 1.15, and 1.16 + +#### Performance Improvements + +- **Engine-native Record Support**: Eliminates Avro transformations, utilizing RowData directly for more efficient serialization/deserialization. Write/read performance improved by **2-3x on average** +- **Async Instant Time Generation**: Significantly improves stability for high throughput workloads by avoiding blocking on instant time generation +- **Meta Fields Control**: Support for `hoodie.populate.meta.fields` in append mode, showing 14% faster writes when disabled + +#### New Capabilities + +- **In-memory Buffer Sort**: For pk-less tables, enables better compaction ratio for columnar formats (`write.buffer.sort.enabled`) +- **Split-level Rate Limiting**: Configure maximum splits per instant check for streaming reads (`read.splits.limit`) + +### Catalogs + +#### Apache Polaris Integration + +Integration with Apache Polaris catalog by delegating table creation to the Polaris Spark client, allowing Hudi tables to be registered in the Polaris Catalog. + +**Configuration**: `hoodie.datasource.polaris.catalog.class` (default: `org.apache.polaris.spark.SparkCatalog`) + +#### AWS Glue & DataHub Sync Enhancements + +- CatalogId support for cross-catalog scenarios +- Explicit database and table name configuration +- Resource tagging for Glue databases and tables +- TLS/HTTPS support for DataHub with custom CA certificates and mutual-TLS + +### Platform Components + +#### Enhanced JSON-to-Avro Conversion for HudiStreamer + +Improved JSON-to-Avro conversion layer for better reliability of the Kafka JSON source. + +#### Prometheus Multi-table Support + +Improved PrometheusReporter with reference-counting mechanism to prevent shared HTTP server shutdown when stopping metrics for individual tables. + +--- + +## API Changes & Deprecations + +### Deprecation of HoodieRecordPayload + +Payload classes are now **deprecated** in favor of merge modes and the merger API. The payload-based approach was closely tied to Avro-formatted records, making it less compatible with native query engine formats like Spark InternalRow. + +**Migration Path**: + +- **Standard Use Cases**: Use merge mode configurations (`COMMIT_TIME_ORDERING` or `EVENT_TIME_ORDERING`) +- **Custom Logic**: Implement `HoodieRecordMerger` interface instead of custom payloads +- **Automatic Migration**: When upgrading to the latest table version, known payloads are automatically migrated to appropriate merge modes + +The merge mode approach supports: + +- Commit time and event time ordering +- Partial update strategies (replaces `OverwriteNonDefaultsWithLatestAvroPayload` and PostgresDebeziumAvroPayload toasted value handling) Review Comment: PostgresDebeziumAvroPayload -> `PostgresDebeziumAvroPayload ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
