yaooqinn opened a new pull request, #12112:
URL: https://github.com/apache/gluten/pull/12112
### What changes were proposed in this pull request?
Skip min/max stats for non-binary-collation `StringType` columns in the
Velox cache path, and write a permissive sentinel bound on the deserialize side
as a fallback for any column whose `supported` flag is 0.
New shim API `SparkShims.isBinaryCollationString` — default `true` for Spark
3.x shims (no collation concept), overridden on Spark 4.0 / 4.1 to check
`collationId == UTF8_BINARY_COLLATION_ID`.
### Why are the changes needed?
On Spark 4.x with a non-binary collation, Velox's `scanMinMax<StringView>`
does an unsigned byte-order compare while Spark's filter compare is
collation-aware. The two disagree, so stats-based pruning can silently drop
matching rows.
Repro:
```scala
spark.sql("CREATE TABLE t(s STRING COLLATE UTF8_LCASE) USING parquet")
spark.sql("INSERT INTO t VALUES 'abc', 'XYZ'")
spark.sql("CACHE TABLE t")
spark.sql("SELECT * FROM t WHERE s = 'ABC'").show()
// Before: 0 rows (wrong). After: 1 row.
```
Vanilla Spark's `StringColumnStats` is collation-aware, so this is
Gluten-specific.
### Does this PR introduce _any_ user-facing change?
Yes — correctness fix. No new config.
### How was this patch tested?
- New `ColumnarCachedBatchDeserializeStatsSentinelSuite` (5 cases: EqualTo /
In / IsNotNull / StartsWith / LessThan) — PASS on spark-3.3 / 3.4 / 3.5 / 4.0 /
4.1.
- `BuildFilterPruneSuite` regression PASS on spark-3.5.
- Cross-profile build 5/5 SUCCESS.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]