yaooqinn opened a new issue, #12135:
URL: https://github.com/apache/gluten/issues/12135
### Background
#12132 added a schema-shape gate (`schemaHeuristic`) on top of
`ColumnarCachedBatchSerializer.validateSchema` so that wide-string
`InMemoryRelation`s fall back to Spark's `DefaultCachedBatchSerializer`
instead of paying the R2C + Arrow materialization tax (root cause: #3456).
The v1 gate is **schema-only**: it looks at the relation's schema and nothing
else. It applies uniformly to all four serializer entry points
(`supportsColumnarInput/Output`,
`convert{InternalRow,ColumnarBatch}ToCachedBatch`,
`convertCachedBatchTo{InternalRow,ColumnarBatch}`) to keep read/write
decisions
consistent.
### Problem
The v1 heuristic is over-conservative when the **child plan is already
columnar** (e.g. a Velox `Scan` feeding directly into `InMemoryRelation`).
In that case:
- There is no R2C tax on the write path — `convertColumnarBatchToCachedBatch`
just copies Velox batches into Arrow storage.
- The wide-string penalty still exists on the **read** path, but its
magnitude
is workload-dependent and we've never measured the columnar-input variant
in isolation.
So today we may be sending some workloads through the row-based fallback that
would in fact win on the columnar path.
### Proposal (v2)
Add a child-plan / R2C-hazard aware variant of the gate. Concrete questions
this issue is here to answer:
1. Where to inspect the child plan. The serializer API only sees the schema;
the child plan is visible at `InMemoryRelation` construction time. A
plan-time rule (e.g. an `ApplyColumnarRulesAndInsertTransitions`-adjacent
strategy) seems like the right hook.
2. What signal to use:
- Is the cached child already a Gluten / Velox columnar node? (no R2C)
- Is there an `RowToColumnar` / `ColumnarToRow` injection on either side?
- Does the workload pattern (selectivity, projection) actually want
Velox's columnar read?
3. How to keep the four entry points consistent once the decision depends
on something other than the schema. v1 gets this for free because schema
is plan-stable; v2 needs the decision recorded somewhere
`InMemoryRelation` carries.
### Benchmark gaps to fill before implementing
- W2-style wide-string workload **with a Velox columnar child** (currently
benchmark only covers row-input).
- Measure the read-side string penalty independently of the write-side R2C
penalty. The +95s warmup tax in #12132 conflates both.
### Non-goals
- Not changing the schema-only v1 gate's defaults.
- Not flipping `spark.gluten.sql.columnar.tableCache` default to true; that
is tracked separately and gated on v1 (and probably v2) shipping in a
release first.
### References
- #3456 — original regression report
- #3488 — historical decision to keep the default off
- #12132 — v1 (this followup is the v2)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]