voonhous opened a new issue, #18999:
URL: https://github.com/apache/hudi/issues/18999

   ### Describe the problem
   
   `HoodieTableMetadataUtil#collectColumnRangeMetadata` iterates the target 
columns inside the per-record loop and, for every record and every target 
field, recomputes values that depend only on the (fixed) target field list:
   
   - `field.schema().getNonNullType()` rebuilds the union-member wrappers for 
nullable fields, allocating a fresh `HoodieSchema` per call (via `getTypes()`).
   - because that non-null schema is a fresh instance on every record, its 
`toAvroSchema()` (used for the min/max comparisons) never memoizes and 
re-derives the Avro schema per record.
   
   It runs once per record per indexed column on the column-stats write path: 
per appended log block in `HoodieAppendHandle`, and per log file in 
`getLogFileColumnRangeMetadata`. For N records and F indexed columns this is 
O(N*F) redundant schema-wrapper rebuilds and Avro conversions, plus the 
transient allocations they create.
   
   ### Proposed fix
   
   Resolve the non-null `HoodieSchema` once per target field before the record 
loop and iterate that precomputed list. Holding a stable `HoodieSchema` 
instance also lets `toAvroSchema()` memoize across records. The per-record work 
(type-support check, value extraction, min / max / null / value counts) is 
unchanged, so results are identical -- `getNonNullType()` and `toAvroSchema()` 
are pure functions of the fixed field schemas. Covered by the existing 
column-stats tests (`TestColumnStatsIndex`, 
`TestColStatsRecordWithMetadataRecord`, `TestDataSkippingWithMORColstats`).
   
   Will raise a PR for this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to