prashantwason opened a new issue, #18059:
URL: https://github.com/apache/hudi/issues/18059

   ## Motivation
   
   When using bulk insert with very large datasets (e.g., billions of records, 
many TB of data), the existing sort modes can cause OOM issues:
   
   - **GLOBAL_SORT**: Requires a global shuffle that can cause OOM when sorting 
large datasets with many files (e.g., 2800+ files each ~1GB)
   - **PARTITION_SORT**: Provides suboptimal file sizes and doesn't help with 
record key ordering across partitions
   - **NONE**: Fast but provides no data ordering benefits for query predicate 
pushdown
   
   Users with 5TB+ datasets and 60+ billion records (see #12116) have reported 
OOM issues with all existing sort modes.
   
   ## Proposed Solution
   
   Add a new `RANGE_BUCKET` sort mode that:
   
   1. **Uses quantile-based bucketing**: Leverages Spark's `approxQuantile()` 
algorithm to compute range boundaries with minimal memory overhead
   2. **Assigns records to buckets**: Records are assigned to buckets based on 
value ranges of a specified column (defaulting to record key)
   3. **Handles skew**: Automatically detects skewed partitions and 
redistributes to sub-partitions
   4. **Optional within-partition sorting**: Can sort records within each 
bucket for better compression
   
   ### Benefits
   
   - **Memory efficient**: Avoids expensive global shuffle operations that 
cause OOM
   - **Better query performance**: Tight value ranges in parquet footers enable 
effective predicate pushdown
   - **Better compression**: Adjacent records (in sorted order) are colocated, 
enabling ~30% better compression with dictionary/RLE encoding
   - **Configurable**: Supports custom partitioning columns, skew detection 
thresholds, and within-partition sorting
   
   ### Proposed Configuration Options
   
   | Config | Description |
   |--------|-------------|
   | `hoodie.bulkinsert.sort.mode=RANGE_BUCKET` | Enable range bucket mode |
   | `hoodie.bulkinsert.range.bucket.partitioning.column` | Column to use for 
bucketing (defaults to record key) |
   | `hoodie.bulkinsert.range.bucket.partitioning.sort.within.partition` | 
Enable within-partition sorting |
   | `hoodie.bulkinsert.range.bucket.partitioning.skew.multiplier` | Threshold 
for skew detection |
   | `hoodie.bulkinsert.range.bucket.partitioning.quantile.accuracy` | Accuracy 
for quantile computation |
   
   ## Use Case
   
   This is particularly useful for:
   - Initial bulk loading of very large datasets
   - Tables with high-cardinality primary keys (UUIDs, etc.)
   - Workloads where query patterns benefit from record key ordering


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to