zhli1142015 commented on PR #12112:
URL: https://github.com/apache/gluten/pull/12112#issuecomment-4486448043
Comment 1 — Test coverage gap
The new sentinel buildFilter tests currently use AttributeReference("c",
StringType)(), which is the default binary-collation StringType. That does not
exercise the actual non-binary collation path this PR is trying to fix.
SimpleMetricsCachedBatchSerializer.buildFilter builds lower/upper-bound
attributes from the cached attribute data type, so for the real failure case
comparisons are collation-aware, not binary byte-order comparisons.
Could you add test coverage with an explicit non-binary collation
StringType (for example UTF8_BINARY_LCASE/UNICODE, depending on the supported
Spark profile) so EqualTo, In, StartsWith, and range predicates are evaluated
through the same collation-aware comparison path as production?
Comment 2 — Sentinel upper bound correctness under non-binary collation
The sentinel upper bound is currently 256 bytes of 0xff. That is a safe
“max” value for binary byte ordering, but for non-binary collations Spark
comparisons are collation-aware. Those bytes are invalid UTF-8 and will not
necessarily compare as a universal maximum value under every Spark collation.
If a literal can compare greater than this sentinel under a non-binary
collation, the parent buildFilter may still prune the batch incorrectly.
A safer design would be to make the Velox serializer’s buildFilter wrapper
ignore/minimize min/max pruning for non-binary-collation string attributes,
rather than fabricating synthetic min/max bounds. That would preserve
correctness by passing through batches whenever the predicate depends on
unsupported string ordering stats.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]