beliefer opened a new issue, #12232:
URL: https://github.com/apache/gluten/issues/12232
### Backend
VL (Velox)
### Bug description
### Backend
VL (Velox)
### Bug description
When a Hive ORC table is read by position rather than by column name —
either because the user sets the Hadoop config
`orc.force.positional.evolution=true`, or because the ORC files were written
with positional field names (`_col0`, `_col1`, ...) as legacy Hive does —
Gluten/Velox returns **empty or wrong results**, while vanilla Spark returns
the correct data for the exact same query.
Gluten's Velox path maps table fields to file fields **by name**
(`spark.gluten.sql.columnar.backend.velox.orcUseColumnNames`, default `true`)
and has **no awareness** of `orc.force.positional.evolution`. When the ORC
file's physical column names do not match the metastore schema (the very
situation `orc.force.positional.evolution=true` exists to handle), name-based
matching fails, the projected columns read back as null/empty, the downstream
filter drops every row, and AQE collapses the final plan into `LocalTableScan`
with 0 rows.
Vanilla Spark handles this correctly in `OrcUtils.requestedColumnIds`:
```scala
// spark/sql/core/.../orc/OrcUtils.scala
val forcePositionalEvolution =
OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)
...
if (forcePositionalEvolution || orcFieldNames.forall(_.startsWith("_col"))) {
// map columns by position
```
Gluten should mirror this: when `orc.force.positional.evolution=true`, it
must read ORC by position (i.e. behave as if `orcUseColumnNames=false`).
### Spark version
Spark-3.x (reproduced on 3.5)
### Spark configurations
```
spark.plugins=org.apache.gluten.GlutenPlugin
spark.sql.hive.convertMetastoreOrc=false
spark.hadoop.orc.force.positional.evolution=true
```
### Reproduction
A Hive ORC table whose underlying files use positional/`_col*` field names
(or any ORC table where the file schema column names differ from the metastore,
with `orc.force.positional.evolution=true` set):
```sql
SELECT col_a, col_b FROM some_orc_table WHERE ds = '20260531';
```
- **Vanilla Spark** (no Gluten plugin): returns the expected rows.
- **Gluten/Velox** (default `orcUseColumnNames=true`): returns **0 rows** —
the scan reads the projected columns as null, the `WHERE` filter removes
everything, AQE folds the plan to `LocalTableScan (rows=0)`.
### Workaround (confirmed)
```sql
SET spark.gluten.sql.columnar.backend.velox.orcUseColumnNames=false;
```
This makes Velox map ORC columns by position, matching vanilla Spark and
`orc.force.positional.evolution=true`. Confirmed to return correct results.
### Proposed fix
Make the **effective** `orcUseColumnNames` value account for
`orc.force.positional.evolution`:
```
effectiveOrcUseColumnNames = orcUseColumnNames &&
!orc.force.positional.evolution
```
This effective value must be applied on **both** ends that consume it:
1. JVM — `VeloxIteratorApi.setFileSchemaForLocalFiles` (decides whether the
table/data schema is passed to native so the reader can do positional mapping).
2. Native — the `spark.gluten.sql.columnar.backend.velox.orcUseColumnNames`
value forwarded to `ConfigExtractor.cc` → Velox `kOrcUseColumnNamesSession`.
Doing this automatically means users who set the standard Spark/Hadoop
config `orc.force.positional.evolution=true` get correct ORC results under
Gluten without having to also discover and set the Gluten-specific
`orcUseColumnNames=false` switch — and the two configs no longer silently
contradict each other.
### Gluten version
main branch
### Spark version
Spark-3.5.x
### Spark configurations
_No response_
### System information
_No response_
### Relevant logs
```bash
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]