voonhous opened a new issue, #18969: URL: https://github.com/apache/hudi/issues/18969
### Describe the problem `BulkInsertDataInternalWriterHelper#write(InternalRow)` redoes constant work for every row when `hoodie.datasource.write.drop.partition.columns=true`: - `writeConfig.shouldDropPartitionColumns()` resolves the config per row - `HoodieDatasetBulkInsertHelper.getPartitionPathCols(writeConfig)` instantiates a brand new key generator via constructor reflection per row (`ReflectionUtils` caches only the Class object, not instances; key generator construction parses TypedProperties and splits field configs) - the partition-column ordinals and a fresh HashSet are recomputed per row - the whole row is converted via `row.toSeq(structType)` (boxing every column), copied to a list, filtered, and rebuilt with `InternalRow.fromSeq` -- a full serde-style round trip per record in what is designed as the zero-conversion fast path None of this depends on the row. The path runs per record in row-writer bulk insert (`HoodieDatasetBulkInsertHelper.bulkInsert`) and clustering rewrites whenever drop-partition-columns is enabled. ### Proposed fix Hoist the constant work to the constructor, guarded by the flag so tables without a key generator class are unaffected: resolve `shouldDropPartitionColumns` once, compute the partition-column ordinals once, and derive the retained ordinals and types for all non-partition fields. In `write()`, when enabled, copy the retained fields into a fresh `GenericInternalRow` with a plain `row.get(ordinal, type)` loop (per-row allocation kept, so aliasing behavior is unchanged); otherwise pass the row through untouched. Output rows are value-identical. Will raise a PR for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
