[PR] [VL] Intern StructType <-> JSON codec on the ColumnarCachedBatchSerializer hot path [gluten]

via GitHub Thu, 04 Jun 2026 02:43:13 -0700


yaooqinn opened a new pull request, #12236:
URL: https://github.com/apache/gluten/pull/12236


   ### Summary
   
   Add an LRU intern cache for the `StructType` <-> JSON wire form used by
   `ColumnarCachedBatchSerializer`, and wire it into the hot write/read
   paths. The wire format is unchanged; the cache memoizes pure functions
   (`StructType.json` / `DataType.fromJson`) that a single Spark query
   typically calls thousands of times against a handful of distinct
   schemas.
   
   ### What
   
   Three commits, logical order:
   
   1. **Add SchemaJsonInternCache** -- `private[execution]` class, two
      Caffeine LRU caches (cap=256 each), encode side
      `StructType -> Array[Byte]`, decode side `String -> StructType`.
      Thread-safety delegated to Caffeine `get(key, mappingFunction)`.
      Six tests pin determinism / capacity / concurrency invariants.
   
   2. **Wire into ColumnarCachedBatchSerializer** -- Replace the two
      ad-hoc codec calls (write side and read side) with cache lookups.
      Cache lives on the serializer companion object so it survives
      Kryo's per-stream serializer churn.
   
   3. **Extend microbench** -- Three sections appended to
      `ColumnarTableCachePartitionStatsBenchmark`:
      - encode round-trip over synthetic + TPC-DS schemas
      - decode round-trip over the same fixture set
      - working-set sweep at cap, 2xcap, 4xcap
   
   ### Why
   
   Schema codec is a pure-function hot path. The bench shows the encode
   leg saturates at on-leg ~6 ms regardless of working set (cap=256 is
   sufficient for the fixture set), while the off-leg ranges from
   seconds at small widths to minutes at 1000-field schemas. Realistic
   TPC-DS schema sees a 12x decode speedup.
   
   ### Bench numbers (verbatim from 
`benchmarks/ColumnarTableCachePartitionStatsBenchmark-results.txt`)
   
   ```
   decode tpcds-store_sales-23col:
     off (raw DataType.fromJson per call)     2207 ms     1.0X
     on  (intern.decodeStructType)             185 ms    11.9X
   
   C1 hit (256 schemas == cap):
     off                                       102 ms     1.0X
     on                                          4 ms    24.3X
   
   C3 churn (1024 schemas == 4x cap):
     off                                       407 ms     1.0X
     on                                         17 ms    24.0X
   ```
   
   Pre-existing partition-stats sections rerun in this commit show no
   regression (build 1.0x / 0.9x, high-sel 8.4x, low-sel 1.5x, point
   12.0x -- within run-to-run variance of the committed baseline).
   
   ### Risk
   
   - Wire format **unchanged** -- read/write still emit length-prefixed
     UTF-8 JSON bytes. Existing on-disk caches readable.
   - Cache miss path is identical to current behavior (pure functions
     re-evaluate; no exception caching).
   - Heap retention bounded at 256 entries x schema-JSON size; ~MB
     worst case on pathological 1000-field schemas, KB-MB on realistic
     workloads.
   - No new SQLConf, no logger, no metric.
   
   ### Test
   
   - New: `SchemaJsonInternCacheSuite` (6 tests, determinism /
     capacity / concurrency)
   - Existing: `ColumnarCachedBatchSerializerHelperSuite` (4 tests)
     passes against wired code, confirming Kryo round-trip + frame
     parsing + fall-back paths intact.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Intern StructType <-> JSON codec on the ColumnarCachedBatchSerializer hot path [gluten]

Reply via email to