wombatu-kun opened a new pull request, #16735:
URL: https://github.com/apache/iceberg/pull/16735
## Problem
`HiveSchemaUtil.convert(Schema)` builds the Hive column list on every Hive
table and view commit (via `HiveOperationsBase.storageDescriptor`). For nested
types it recursed through `convert(Type)`, which is
`TypeInfoUtils.getTypeInfoFromTypeString(convertToTypeString(type))`: it built
the type substring, parsed it back into a Hive `TypeInfo` tree, then
stringified that tree straight back to the same substring. That round trip
produces no value, and because the recursion went through the parser at every
nesting level the wasted work and the throwaway `TypeInfo` allocations
compounded with schema depth.
It was also a latent crash. An Iceberg `VARIANT` maps to the placeholder
string `"unknown"`, which is not a Hive type, so
`TypeInfoUtils.getTypeInfoFromTypeString("unknown")` throws `RuntimeException:
Internal error parsing position 0 of 'unknown'`. A `VARIANT` nested inside a
`struct`/`list`/`map` therefore failed the commit outright. The top-level case
never hit this because `convert(Schema)` calls `convertToTypeString` directly.
## Change
Rewrite the private `convertToTypeString` to recurse on itself via a
`StringBuilder`-based `appendTypeString(StringBuilder, Type)`, building the
Hive type string directly with no `TypeInfoUtils` parsing and no
`String.format`. The public methods and their outputs are unchanged; only the
internal string builder is restructured. Output strings are identical for every
type the previous code handled, which is pinned byte for byte by the existing
`testComplexSchemaConvertToHiveSchema`. Nested `VARIANT` now emits `"unknown"`
directly, consistent with the already-tested top-level mapping.
## Tests
- Added `testNestedVariantTypeConvertToHiveSchema`: it throws on the
previous code (reproducing the failure) and is green after the change.
- Existing `TestHiveSchemaUtil` coverage (nested struct/list/map strings,
every primitive type, reverse conversion) still passes, confirming output is
unchanged for all other types.
## Performance
Added `HiveSchemaConversionBenchmark` (JMH). The conversion runs once per
commit; the benchmark isolates `HiveSchemaUtil.convert(Schema)`. JDK 17,
average time, `-prof gc` (speedup is before/after):
| shape | time before | time after | time speedup | alloc before | alloc
after | alloc reduction |
|---|---|---|---|---|---|---|
| flat (30 primitives) | 2.74 us/op | 2.12 us/op | 1.29x | 6,080 B/op |
5,872 B/op | 1.04x |
| nested (struct / list / map columns) | 45.66 us/op | 4.53 us/op | 10.1x |
118,632 B/op | 7,400 B/op | 16.0x |
| deeply nested (10 struct levels) | 40.87 us/op | 0.92 us/op | 44.3x |
122,576 B/op | 2,184 B/op | 56.1x |
Flat schemas are essentially unchanged (primitive cases never used the
parsing path); the gain is concentrated on nested schemas, where the throwaway
`TypeInfo` trees dominated, and grows with nesting depth (the old code
re-parsed each subtree at every level).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]