yaooqinn opened a new pull request, #12132:
URL: https://github.com/apache/gluten/pull/12132

   ## What changes were proposed in this pull request?
   
   `ColumnarCachedBatchSerializer` currently delegates to the row-based 
`DefaultCachedBatchSerializer` only when Velox cannot validate a schema at all. 
On wide-string / wide-row workloads the schema validates fine, but the R2C + 
Arrow materialization tax dominates and the columnar cache loses to the 
row-based path. This PR adds a second, session-overridable gate so each 
`InMemoryRelation` is routed independently based on schema shape.
   
   Two new configs:
   
   | Conf | Default | Meaning |
   | --- | --- | --- |
   | `spark.gluten.sql.columnar.tableCache.maxStringFraction` | `0.5` | Max 
string/binary column ratio that still takes the columnar path |
   | `spark.gluten.sql.columnar.tableCache.maxAvgRowBytes` | `1024` | Upper 
bound on `sum(dataType.defaultSize)` |
   
   Schemas exceeding either bound fall back to the row-based serializer. All 
four call sites (`supportsColumnarInput`, `supportsColumnarOutput`, 
`convertInternalRowToCachedBatch`, 
`convertCachedBatchTo{InternalRow,ColumnarBatch}`) are kept in sync so the read 
side never disagrees with the write side.
   
   ### Why
   
   The columnar table cache is gated behind 
`spark.gluten.sql.columnar.tableCache=false` precisely because of regressions 
like #3456. Locally collected numbers on Spark 3.5.6 + Velox nightly, 16 GB 
string cache, single-node WSL Ubuntu 24.04:
   
   | Workload | vanilla | gluten-off | gluten-on (warmup) | gluten-on (steady) |
   | --- | --- | --- | --- | --- |
   | W1 numeric (16 longs) | 8853 ms | 8853 ms | ~ | ~ (1.62× faster than 
vanilla) |
   | W2 wide-string (16× ~200 chars) | 45245 ms | 53883 ms | **148574 ms** | 
**80016 ms** |
   
   W2 confirms the #3456 hazard exists: columnar cache is 1.49× slower than the 
row-based path in steady state and pays an extra +95 s on warmup. W1 confirms 
the columnar path is a real win on numeric schemas. A single global default 
cannot satisfy both — hence the per-relation gate.
   
   This PR does **not** flip the `spark.gluten.sql.columnar.tableCache` 
default. The gate just makes a future default flip safer.
   
   The gate is intentionally schema-only in v1. Child-plan / R2C-hazard 
heuristics (e.g. cache-of-cache, post-shuffle exchange) are left as a followup.
   
   ## How was this patch tested?
   
   New suite `ColumnarCachedBatchSchemaHeuristicSuite` covers:
   
   - numeric-only schema → columnar path
   - wide-string schema → row fallback (default settings)
   - session override re-enables columnar path
   - 50/50 string-vs-long schema at the inclusive boundary
   - avg-row-bytes gate alone rejects a wide numeric schema
   - end-to-end wide-string cache + filter roundtrip via the row fallback 
returns correct rows
   
   Local `mvn install` clean on `gluten-substrait` and `backends-velox`. The 
full scalatest suite requires a built native `libgluten.so` and will run in CI.
   
   Generated-by: claude-opus-4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to