voonhous opened a new issue, #18996:
URL: https://github.com/apache/hudi/issues/18996

   ### Describe the problem
   
   Two per-record costs on the metadata-table read path, both in 
`MetadataPartitionType` / `HoodieMetadataPayload`:
   
   1. `MetadataPartitionType.get(int)` iterates `values()`, which clones the 
enum constant array on every call. It runs once per record materialized from 
the metadata table: the `HoodieMetadataPayload(Option<GenericRecord>)` 
constructor calls 
`MetadataPartitionType.get(type).constructMetadataPayload(...)` for every 
record returned by RLI / secondary-index / column-stats lookups and full scans, 
and `preCombine` repeats it for every key merged from MDT log files.
   
   2. `RECORD_INDEX.constructMetadataPayload` decodes the numeric record-index 
fields with `Long.parseLong(record.get(field).toString())` / 
`Integer.parseInt(...)`, even though the Avro generic record already holds 
boxed `Long` / `Integer` values matching the `long` / `int` field types in 
`HoodieMetadata.avsc`. That is five `String` allocations plus five parses per 
materialized RLI record.
   
   For upsert tagging that reads millions of RLI entries, this is pure 
per-record garbage and CPU on the index-lookup hot path.
   
   ### Proposed fix
   
   1. Cache `values()` once in a `private static final MetadataPartitionType[]` 
and iterate that in `get(int)`. The linear scan and the 
`IllegalArgumentException` for unknown types are unchanged. A direct-index 
lookup table is avoided because `EXPRESSION_INDEX` has record type `-1`.
   
   2. Read the numeric record-index fields directly via `((Number) 
record.get(field)).longValue()` / `.intValue()` instead of `toString` + parse. 
The avsc declares these fields `long` / `int`, so the values are always `Long` 
/ `Integer`, and RLI records always populate them (UUID encoding sets the bits, 
raw encoding sets `-1` sentinels). The string fields (`partition`, `fileId`) 
keep `.toString()`. Reconstructed records are identical.
   
   Behavior-preserving; verified with an avro write/read round-trip over both 
fileId encodings.
   
   Will raise a PR for this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to