yaooqinn commented on PR #12112: URL: https://github.com/apache/gluten/pull/12112#issuecomment-4489683082
Thanks @zhli1142015 for the thorough review — both points are valid and addressed in rev 3.2 (`48c8bc749d`). **C1 (test coverage)**: The earlier sentinel-suite only exercised the deserialize hot path. Added 8 wrapper-behavior tests (`ColumnarCachedBatchBuildFilterPruneSuite` W1–W8) plus end-to-end UTF8_LCASE / UNICODE_CI cases in `ColumnarCachedBatchE2ESuite`. W8 is an explicit anti-regression that bypasses the wrapper to prove a UTF8_LCASE bound check would have wrongly pruned the batch. **C2 (sentinel correctness)**: You are right — `0xFF * 256B` is not a universal upper bound under non-binary collations. `PhysicalStringType.ordering = CollationFactory.fetchCollation(id).comparator` (`PhysicalDataType.scala:334`) means `<=` on collation-aware StringType is governed by the collation’s comparator, not by raw byte order, so any fixed sentinel can falsify the bound check. Rev 3.2 abandons the sentinel approach entirely. Reader-side mechanism is now a `splitConjunctivePredicates`-based wrapper around `SimpleMetricsCachedBatchSerializer.buildFilter`: any AND-conjunct referencing a non-binary collation StringType attribute is dropped before partition-stats evaluation; binary attributes still prune. Conjuncts that cannot be split (Or, etc.) referencing such attributes conservatively keep the batch. Writer/wire-format unchanged. Verified locally: - `mvn clean install` + suites on `-Pspark-3.5/scala-2.12`, `-Pspark-4.0/scala-2.13`, `-Pspark-4.1/scala-2.13` - spark-4.0 / 4.1: 42/42 PASS - spark-3.5: 32 PASS + 10 cleanly canceled (W1–W8 + 2 E2E collation cases guarded by `assume(isCollationAware)`, since CollationFactory does not exist on 3.5) PR description updated to reflect rev 3.2. PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
