[I] perf: Row-writer bulk insert re-instantiates the key generator and boxes the full row per record when dropping partition columns [hudi]

via GitHub Wed, 10 Jun 2026 08:46:41 -0700


voonhous opened a new issue, #18969:
URL: https://github.com/apache/hudi/issues/18969


   ### Describe the problem
   
   `BulkInsertDataInternalWriterHelper#write(InternalRow)` redoes constant work 
for every row when `hoodie.datasource.write.drop.partition.columns=true`:
   
   - `writeConfig.shouldDropPartitionColumns()` resolves the config per row
   - `HoodieDatasetBulkInsertHelper.getPartitionPathCols(writeConfig)` 
instantiates a brand new key generator via constructor reflection per row 
(`ReflectionUtils` caches only the Class object, not instances; key generator 
construction parses TypedProperties and splits field configs)
   - the partition-column ordinals and a fresh HashSet are recomputed per row
   - the whole row is converted via `row.toSeq(structType)` (boxing every 
column), copied to a list, filtered, and rebuilt with `InternalRow.fromSeq` -- 
a full serde-style round trip per record in what is designed as the 
zero-conversion fast path
   
   None of this depends on the row. The path runs per record in row-writer bulk 
insert (`HoodieDatasetBulkInsertHelper.bulkInsert`) and clustering rewrites 
whenever drop-partition-columns is enabled.
   
   ### Proposed fix
   
   Hoist the constant work to the constructor, guarded by the flag so tables 
without a key generator class are unaffected: resolve 
`shouldDropPartitionColumns` once, compute the partition-column ordinals once, 
and derive the retained ordinals and types for all non-partition fields. In 
`write()`, when enabled, copy the retained fields into a fresh 
`GenericInternalRow` with a plain `row.get(ordinal, type)` loop (per-row 
allocation kept, so aliasing behavior is unchanged); otherwise pass the row 
through untouched. Output rows are value-identical.
   
   Will raise a PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] perf: Row-writer bulk insert re-instantiates the key generator and boxes the full row per record when dropping partition columns [hudi]

Reply via email to