prashantwason opened a new issue, #18059: URL: https://github.com/apache/hudi/issues/18059
## Motivation When using bulk insert with very large datasets (e.g., billions of records, many TB of data), the existing sort modes can cause OOM issues: - **GLOBAL_SORT**: Requires a global shuffle that can cause OOM when sorting large datasets with many files (e.g., 2800+ files each ~1GB) - **PARTITION_SORT**: Provides suboptimal file sizes and doesn't help with record key ordering across partitions - **NONE**: Fast but provides no data ordering benefits for query predicate pushdown Users with 5TB+ datasets and 60+ billion records (see #12116) have reported OOM issues with all existing sort modes. ## Proposed Solution Add a new `RANGE_BUCKET` sort mode that: 1. **Uses quantile-based bucketing**: Leverages Spark's `approxQuantile()` algorithm to compute range boundaries with minimal memory overhead 2. **Assigns records to buckets**: Records are assigned to buckets based on value ranges of a specified column (defaulting to record key) 3. **Handles skew**: Automatically detects skewed partitions and redistributes to sub-partitions 4. **Optional within-partition sorting**: Can sort records within each bucket for better compression ### Benefits - **Memory efficient**: Avoids expensive global shuffle operations that cause OOM - **Better query performance**: Tight value ranges in parquet footers enable effective predicate pushdown - **Better compression**: Adjacent records (in sorted order) are colocated, enabling ~30% better compression with dictionary/RLE encoding - **Configurable**: Supports custom partitioning columns, skew detection thresholds, and within-partition sorting ### Proposed Configuration Options | Config | Description | |--------|-------------| | `hoodie.bulkinsert.sort.mode=RANGE_BUCKET` | Enable range bucket mode | | `hoodie.bulkinsert.range.bucket.partitioning.column` | Column to use for bucketing (defaults to record key) | | `hoodie.bulkinsert.range.bucket.partitioning.sort.within.partition` | Enable within-partition sorting | | `hoodie.bulkinsert.range.bucket.partitioning.skew.multiplier` | Threshold for skew detection | | `hoodie.bulkinsert.range.bucket.partitioning.quantile.accuracy` | Accuracy for quantile computation | ## Use Case This is particularly useful for: - Initial bulk loading of very large datasets - Tables with high-cardinality primary keys (UUIDs, etc.) - Workloads where query patterns benefit from record key ordering -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
