wombatu-kun opened a new pull request, #16735:
URL: https://github.com/apache/iceberg/pull/16735

   ## Problem
   
   `HiveSchemaUtil.convert(Schema)` builds the Hive column list on every Hive 
table and view commit (via `HiveOperationsBase.storageDescriptor`). For nested 
types it recursed through `convert(Type)`, which is 
`TypeInfoUtils.getTypeInfoFromTypeString(convertToTypeString(type))`: it built 
the type substring, parsed it back into a Hive `TypeInfo` tree, then 
stringified that tree straight back to the same substring. That round trip 
produces no value, and because the recursion went through the parser at every 
nesting level the wasted work and the throwaway `TypeInfo` allocations 
compounded with schema depth.
   
   It was also a latent crash. An Iceberg `VARIANT` maps to the placeholder 
string `"unknown"`, which is not a Hive type, so 
`TypeInfoUtils.getTypeInfoFromTypeString("unknown")` throws `RuntimeException: 
Internal error parsing position 0 of 'unknown'`. A `VARIANT` nested inside a 
`struct`/`list`/`map` therefore failed the commit outright. The top-level case 
never hit this because `convert(Schema)` calls `convertToTypeString` directly.
   
   ## Change
   
   Rewrite the private `convertToTypeString` to recurse on itself via a 
`StringBuilder`-based `appendTypeString(StringBuilder, Type)`, building the 
Hive type string directly with no `TypeInfoUtils` parsing and no 
`String.format`. The public methods and their outputs are unchanged; only the 
internal string builder is restructured. Output strings are identical for every 
type the previous code handled, which is pinned byte for byte by the existing 
`testComplexSchemaConvertToHiveSchema`. Nested `VARIANT` now emits `"unknown"` 
directly, consistent with the already-tested top-level mapping.
   
   ## Tests
   
   - Added `testNestedVariantTypeConvertToHiveSchema`: it throws on the 
previous code (reproducing the failure) and is green after the change.
   - Existing `TestHiveSchemaUtil` coverage (nested struct/list/map strings, 
every primitive type, reverse conversion) still passes, confirming output is 
unchanged for all other types.
   
   ## Performance
   
   Added `HiveSchemaConversionBenchmark` (JMH). The conversion runs once per 
commit; the benchmark isolates `HiveSchemaUtil.convert(Schema)`. JDK 17, 
average time, `-prof gc` (speedup is before/after):
   
   | shape | time before | time after | time speedup | alloc before | alloc 
after | alloc reduction |
   |---|---|---|---|---|---|---|
   | flat (30 primitives) | 2.74 us/op | 2.12 us/op | 1.29x | 6,080 B/op | 
5,872 B/op | 1.04x |
   | nested (struct / list / map columns) | 45.66 us/op | 4.53 us/op | 10.1x | 
118,632 B/op | 7,400 B/op | 16.0x |
   | deeply nested (10 struct levels) | 40.87 us/op | 0.92 us/op | 44.3x | 
122,576 B/op | 2,184 B/op | 56.1x |
   
   Flat schemas are essentially unchanged (primitive cases never used the 
parsing path); the gain is concentrated on nested schemas, where the throwaway 
`TypeInfo` trees dominated, and grows with nesting depth (the old code 
re-parsed each subtree at every level).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to