nsivabalan opened a new pull request, #18865:
URL: https://github.com/apache/hudi/pull/18865

   ### Describe the issue this Pull Request addresses
   
   Follow-up to #18353, which added 
`hoodie.metadata.record.level.index.defer.init` to defer Record Level Index 
(RLI) bootstrap to the 2nd commit on a fresh table. The original change relied 
on `dataWriteConfig.getWriteSchema()` to resolve the read/data schema when 
initializing the RLI partition. That schema is not always populated:
   
   - When the metadata writer is constructed on the 2nd commit (after a 
deferred first commit), the write config used to build the metadata writer may 
not carry the avro schema string.
   - When the metadata writer is constructed outside an active write (e.g. via 
`metadataWriter(writeConfig)` for reads), the same gap exists.
   
   In those cases `HoodieSchema.parse(dataWriteConfig.getWriteSchema())` fails, 
blocking RLI from initializing on commit #2. The deferred path with bulk_insert 
hit this bug.
   
   ### Summary and Changelog
   
   **Core fix (`HoodieBackedTableMetadataWriter.java`)**
   - Plumb a resolved `HoodieSchema` argument through the RLI init chain: 
`initializeFilegroupsAndCommitToRecordIndexPartition` → 
`initializeFilegroupsAndCommitToPartitionedRecordIndexPartition` → 
`initializeRecordIndexPartition` → `readRecordKeysFromFileSliceSnapshot`. 
Replaces the inline `HoodieSchema.parse(dataWriteConfig.getWriteSchema())` 
previously evaluated inside the executor closure.
   - New `resolveRecordIndexInitSchema(...)` helper: prefer 
`dataWriteConfig.getWriteSchema()`; on empty, fall back to 
`HoodieTableMetadataUtil.tryResolveSchemaForTable(dataMetaClient)` (the latest 
committed schema). Throws a clear `HoodieMetadataException` when neither is 
resolvable.
   - Renamed the local `Lazy<Option<HoodieSchema>> tableSchema` → 
`tableSchemaLazy` at the call site for clarity; javadoc on 
`readRecordKeysFromFileSliceSnapshot` updated.
   
   **Tests (`TestRecordLevelIndex.scala`)**
   - Extended `testRecordLevelIndex` with a `deferRLIInit` parameter. When set, 
the test asserts that after the first save the RLI partition is NOT yet present 
in the metadata table config; it then proceeds through the existing assertion 
flow which builds the metadata writer (triggering deferred init on the 2nd 
entry).
   - Added `testPartitionedRecordLevelIndexDefer(streamingWriteEnabled)` which 
drives the deferred path via the existing helper and then verifies compaction.
   - Added 
`testPartitionedRecordLevelIndexDeferWithBulkInsert(streamingWriteEnabled)`: 
commit #1 and commit #2 are both `bulk_insert` against a fresh table with defer 
enabled. Validates:
     - After commit #1 the RLI metadata partition is not initialized.
     - After commit #2 the deferred RLI bootstrap completes (partition present, 
partitioned RLI type).
     - Record-key → location mapping is correct across all data partitions for 
both batches, including cross-partition negative lookups.
   
   ### Impact
   
   User-facing changes: none beyond what was introduced in #18353. This is a 
follow-up bug fix that makes the opt-in deferred RLI init flow actually usable 
on the 2nd commit (including bulk_insert).
   
   Performance impact: none.
   
   ### Risk Level
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to