beliefer opened a new issue, #12232:
URL: https://github.com/apache/gluten/issues/12232

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   When a Hive ORC table is read by position rather than by column name — 
either because the user sets the Hadoop config 
`orc.force.positional.evolution=true`, or because the ORC files were written 
with positional field names (`_col0`, `_col1`, ...) as legacy Hive does — 
Gluten/Velox returns **empty or wrong results**, while vanilla Spark returns 
the correct data for the exact same query.
   
   Gluten's Velox path maps table fields to file fields **by name** 
(`spark.gluten.sql.columnar.backend.velox.orcUseColumnNames`, default `true`) 
and has **no awareness** of `orc.force.positional.evolution`. When the ORC 
file's physical column names do not match the metastore schema (the very 
situation `orc.force.positional.evolution=true` exists to handle), name-based 
matching fails, the projected columns read back as null/empty, the downstream 
filter drops every row, and AQE collapses the final plan into `LocalTableScan` 
with 0 rows.
   
   Vanilla Spark handles this correctly in `OrcUtils.requestedColumnIds`:
   
   ```scala
   // spark/sql/core/.../orc/OrcUtils.scala
   val forcePositionalEvolution = 
OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
   ...
   if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
     // map columns by position
   ```
   
   Gluten should mirror this: when `orc.force.positional.evolution=true`, it 
must read ORC by position (i.e. behave as if `orcUseColumnNames=false`).
   
   ### Spark version
   
   Spark-3.x (reproduced on 3.5)
   
   ### Spark configurations
   
   ```
   spark.plugins=org.apache.gluten.GlutenPlugin
   spark.sql.hive.convertMetastoreOrc=false
   spark.hadoop.orc.force.positional.evolution=true
   ```
   
   ### Reproduction
   
   A Hive ORC table whose underlying files use positional/`_col*` field names 
(or any ORC table where the file schema column names differ from the metastore, 
with `orc.force.positional.evolution=true` set):
   
   ```sql
   SELECT col_a, col_b FROM some_orc_table WHERE ds = '20260531';
   ```
   
   - **Vanilla Spark** (no Gluten plugin): returns the expected rows.
   - **Gluten/Velox** (default `orcUseColumnNames=true`): returns **0 rows** — 
the scan reads the projected columns as null, the `WHERE` filter removes 
everything, AQE folds the plan to `LocalTableScan (rows=0)`.
   
   ### Workaround (confirmed)
   
   ```sql
   SET spark.gluten.sql.columnar.backend.velox.orcUseColumnNames=false;
   ```
   
   This makes Velox map ORC columns by position, matching vanilla Spark and 
`orc.force.positional.evolution=true`. Confirmed to return correct results.
   
   ### Proposed fix
   
   Make the **effective** `orcUseColumnNames` value account for 
`orc.force.positional.evolution`:
   
   ```
   effectiveOrcUseColumnNames = orcUseColumnNames && 
!orc.force.positional.evolution
   ```
   
   This effective value must be applied on **both** ends that consume it:
   
   1. JVM — `VeloxIteratorApi.setFileSchemaForLocalFiles` (decides whether the 
table/data schema is passed to native so the reader can do positional mapping).
   2. Native — the `spark.gluten.sql.columnar.backend.velox.orcUseColumnNames` 
value forwarded to `ConfigExtractor.cc` → Velox `kOrcUseColumnNamesSession`.
   
   Doing this automatically means users who set the standard Spark/Hadoop 
config `orc.force.positional.evolution=true` get correct ORC results under 
Gluten without having to also discover and set the Gluten-specific 
`orcUseColumnNames=false` switch — and the two configs no longer silently 
contradict each other.
   
   ### Gluten version
   
   main branch
   
   ### Spark version
   
   Spark-3.5.x
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to