yaooqinn opened a new pull request, #12236:
URL: https://github.com/apache/gluten/pull/12236
### Summary
Add an LRU intern cache for the `StructType` <-> JSON wire form used by
`ColumnarCachedBatchSerializer`, and wire it into the hot write/read
paths. The wire format is unchanged; the cache memoizes pure functions
(`StructType.json` / `DataType.fromJson`) that a single Spark query
typically calls thousands of times against a handful of distinct
schemas.
### What
Three commits, logical order:
1. **Add SchemaJsonInternCache** -- `private[execution]` class, two
Caffeine LRU caches (cap=256 each), encode side
`StructType -> Array[Byte]`, decode side `String -> StructType`.
Thread-safety delegated to Caffeine `get(key, mappingFunction)`.
Six tests pin determinism / capacity / concurrency invariants.
2. **Wire into ColumnarCachedBatchSerializer** -- Replace the two
ad-hoc codec calls (write side and read side) with cache lookups.
Cache lives on the serializer companion object so it survives
Kryo's per-stream serializer churn.
3. **Extend microbench** -- Three sections appended to
`ColumnarTableCachePartitionStatsBenchmark`:
- encode round-trip over synthetic + TPC-DS schemas
- decode round-trip over the same fixture set
- working-set sweep at cap, 2xcap, 4xcap
### Why
Schema codec is a pure-function hot path. The bench shows the encode
leg saturates at on-leg ~6 ms regardless of working set (cap=256 is
sufficient for the fixture set), while the off-leg ranges from
seconds at small widths to minutes at 1000-field schemas. Realistic
TPC-DS schema sees a 12x decode speedup.
### Bench numbers (verbatim from
`benchmarks/ColumnarTableCachePartitionStatsBenchmark-results.txt`)
```
decode tpcds-store_sales-23col:
off (raw DataType.fromJson per call) 2207 ms 1.0X
on (intern.decodeStructType) 185 ms 11.9X
C1 hit (256 schemas == cap):
off 102 ms 1.0X
on 4 ms 24.3X
C3 churn (1024 schemas == 4x cap):
off 407 ms 1.0X
on 17 ms 24.0X
```
Pre-existing partition-stats sections rerun in this commit show no
regression (build 1.0x / 0.9x, high-sel 8.4x, low-sel 1.5x, point
12.0x -- within run-to-run variance of the committed baseline).
### Risk
- Wire format **unchanged** -- read/write still emit length-prefixed
UTF-8 JSON bytes. Existing on-disk caches readable.
- Cache miss path is identical to current behavior (pure functions
re-evaluate; no exception caching).
- Heap retention bounded at 256 entries x schema-JSON size; ~MB
worst case on pathological 1000-field schemas, KB-MB on realistic
workloads.
- No new SQLConf, no logger, no metric.
### Test
- New: `SchemaJsonInternCacheSuite` (6 tests, determinism /
capacity / concurrency)
- Existing: `ColumnarCachedBatchSerializerHelperSuite` (4 tests)
passes against wired code, confirming Kryo round-trip + frame
parsing + fall-back paths intact.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]