yaooqinn opened a new pull request, #12132:
URL: https://github.com/apache/gluten/pull/12132
## What changes were proposed in this pull request?
`ColumnarCachedBatchSerializer` currently delegates to the row-based
`DefaultCachedBatchSerializer` only when Velox cannot validate a schema at all.
On wide-string / wide-row workloads the schema validates fine, but the R2C +
Arrow materialization tax dominates and the columnar cache loses to the
row-based path. This PR adds a second, session-overridable gate so each
`InMemoryRelation` is routed independently based on schema shape.
Two new configs:
| Conf | Default | Meaning |
| --- | --- | --- |
| `spark.gluten.sql.columnar.tableCache.maxStringFraction` | `0.5` | Max
string/binary column ratio that still takes the columnar path |
| `spark.gluten.sql.columnar.tableCache.maxAvgRowBytes` | `1024` | Upper
bound on `sum(dataType.defaultSize)` |
Schemas exceeding either bound fall back to the row-based serializer. All
four call sites (`supportsColumnarInput`, `supportsColumnarOutput`,
`convertInternalRowToCachedBatch`,
`convertCachedBatchTo{InternalRow,ColumnarBatch}`) are kept in sync so the read
side never disagrees with the write side.
### Why
The columnar table cache is gated behind
`spark.gluten.sql.columnar.tableCache=false` precisely because of regressions
like #3456. Locally collected numbers on Spark 3.5.6 + Velox nightly, 16 GB
string cache, single-node WSL Ubuntu 24.04:
| Workload | vanilla | gluten-off | gluten-on (warmup) | gluten-on (steady) |
| --- | --- | --- | --- | --- |
| W1 numeric (16 longs) | 8853 ms | 8853 ms | ~ | ~ (1.62× faster than
vanilla) |
| W2 wide-string (16× ~200 chars) | 45245 ms | 53883 ms | **148574 ms** |
**80016 ms** |
W2 confirms the #3456 hazard exists: columnar cache is 1.49× slower than the
row-based path in steady state and pays an extra +95 s on warmup. W1 confirms
the columnar path is a real win on numeric schemas. A single global default
cannot satisfy both — hence the per-relation gate.
This PR does **not** flip the `spark.gluten.sql.columnar.tableCache`
default. The gate just makes a future default flip safer.
The gate is intentionally schema-only in v1. Child-plan / R2C-hazard
heuristics (e.g. cache-of-cache, post-shuffle exchange) are left as a followup.
## How was this patch tested?
New suite `ColumnarCachedBatchSchemaHeuristicSuite` covers:
- numeric-only schema → columnar path
- wide-string schema → row fallback (default settings)
- session override re-enables columnar path
- 50/50 string-vs-long schema at the inclusive boundary
- avg-row-bytes gate alone rejects a wide numeric schema
- end-to-end wide-string cache + filter roundtrip via the row fallback
returns correct rows
Local `mvn install` clean on `gluten-substrait` and `backends-velox`. The
full scalatest suite requires a built native `libgluten.so` and will run in CI.
Generated-by: claude-opus-4.7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]