nsivabalan opened a new pull request, #18353:
URL: https://github.com/apache/hudi/pull/18353

   ### Describe the issue this Pull Request addresses
   
   For Record Level Index (RLI), the default of 10 or 1 file groups might 
mis-align w/ actual table size. The issue is that metadata table and RLI 
initialization happens **before the first commit completes**, so Hudi cannot 
assess the actual table size to   determine an appropriate number of file 
groups. This results in:
   
     **Large bootstrap scenarios**: The hardcoded default may be too small for 
tables initialized with large data loads
   
   ### Summary and Changelog
   
    **What users gain:**
     - **Optimized RLI file group allocation**: For fresh tables, RLI 
initialization is now deferred until after the first commit, allowing
     Hudi to programmatically determine the optimal number of file groups based 
on actual data size
     - **Better resource utilization**: Small tables will use fewer file groups 
(as low as 1), while large bootstrap scenarios will allocate
     more file groups appropriately within configured min/max bounds
     - **Transparent optimization**: No user configuration changes needed - the 
deferral happens automatically
   
     **Detailed Changes:**
     - **HoodieBackedTableMetadataWriter.java (lines 454-456)**: Added logic to 
defer RLI initialization for fresh tables (tables with zero
     completed instants) by removing `RECORD_INDEX` from 
`enabledPartitionTypes` during the first commit
     - **TestHoodieBackedMetadata.java**: Added two comprehensive tests:
       - `testPartitionedRecordIndexDeferredInitializationForFreshTable`: 
Validates RLI is NOT initialized on 1st commit but IS initialized
     on 2nd commit with file group count = 1 for small data (150 records)
       - `testPartitionedRecordIndexLargerDataFileGroupCount`: Validates that 
with larger data (7000 records), file group count is
     programmatically determined (> 1) based on the `estimateFileGroupCount` 
logic
   
     **How it works:**
     1. On **first commit** (fresh table): RLI initialization is skipped even 
if enabled in config
     2. On **second commit**: RLI initialization proceeds normally, and 
`estimateFileGroupCount()` uses the actual record count from the
     first commit to determine file groups within the configured min/max bounds
   
   ### Impact
   
    **User-facing changes:**
     - **Behavioral change (non-breaking)**: For new tables with RLI enabled, 
the RLI partition will not be available after the first commit,
      but will be available starting from the second commit
     - **Performance improvement**: Reduced overhead for small tables, better 
scaling for large bootstrap scenarios
     - **No config changes needed**: Existing configurations continue to work; 
the optimization is automatic
   
     **Performance impact:**
     - Small tables (< 1000 records): Expected reduction from 10 file groups to 
1-2 file groups, reducing metadata table overhead
     - Large bootstrap tables (> 100K records): Better distribution across more 
file groups within max bounds
   
   
   ### Risk Level
   
   low
   
   ### Documentation Update
   
   <!-- Describe any necessary documentation update if there is any new 
feature, config, or user-facing change. If not, put "none".
   
   - The config description must be updated if new configs are added or the 
default value of the configs are changed.
   - Any new feature or user-facing change requires updating the Hudi website. 
Please follow the 
     [instruction](https://hudi.apache.org/contribute/developer-setup#website) 
to make changes to the website. -->
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to