yaooqinn opened a new issue, #12135:
URL: https://github.com/apache/gluten/issues/12135

   ### Background
   
   #12132 added a schema-shape gate (`schemaHeuristic`) on top of
   `ColumnarCachedBatchSerializer.validateSchema` so that wide-string
   `InMemoryRelation`s fall back to Spark's `DefaultCachedBatchSerializer`
   instead of paying the R2C + Arrow materialization tax (root cause: #3456).
   
   The v1 gate is **schema-only**: it looks at the relation's schema and nothing
   else. It applies uniformly to all four serializer entry points
   (`supportsColumnarInput/Output`, 
`convert{InternalRow,ColumnarBatch}ToCachedBatch`,
   `convertCachedBatchTo{InternalRow,ColumnarBatch}`) to keep read/write 
decisions
   consistent.
   
   ### Problem
   
   The v1 heuristic is over-conservative when the **child plan is already
   columnar** (e.g. a Velox `Scan` feeding directly into `InMemoryRelation`).
   In that case:
   
   - There is no R2C tax on the write path — `convertColumnarBatchToCachedBatch`
     just copies Velox batches into Arrow storage.
   - The wide-string penalty still exists on the **read** path, but its 
magnitude
     is workload-dependent and we've never measured the columnar-input variant
     in isolation.
   
   So today we may be sending some workloads through the row-based fallback that
   would in fact win on the columnar path.
   
   ### Proposal (v2)
   
   Add a child-plan / R2C-hazard aware variant of the gate. Concrete questions
   this issue is here to answer:
   
   1. Where to inspect the child plan. The serializer API only sees the schema;
      the child plan is visible at `InMemoryRelation` construction time. A
      plan-time rule (e.g. an `ApplyColumnarRulesAndInsertTransitions`-adjacent
      strategy) seems like the right hook.
   2. What signal to use:
      - Is the cached child already a Gluten / Velox columnar node? (no R2C)
      - Is there an `RowToColumnar` / `ColumnarToRow` injection on either side?
      - Does the workload pattern (selectivity, projection) actually want
        Velox's columnar read?
   3. How to keep the four entry points consistent once the decision depends
      on something other than the schema. v1 gets this for free because schema
      is plan-stable; v2 needs the decision recorded somewhere
      `InMemoryRelation` carries.
   
   ### Benchmark gaps to fill before implementing
   
   - W2-style wide-string workload **with a Velox columnar child** (currently
     benchmark only covers row-input).
   - Measure the read-side string penalty independently of the write-side R2C
     penalty. The +95s warmup tax in #12132 conflates both.
   
   ### Non-goals
   
   - Not changing the schema-only v1 gate's defaults.
   - Not flipping `spark.gluten.sql.columnar.tableCache` default to true; that
     is tracked separately and gated on v1 (and probably v2) shipping in a
     release first.
   
   ### References
   
   - #3456 — original regression report
   - #3488 — historical decision to keep the default off
   - #12132 — v1 (this followup is the v2)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to