zhli1142015 commented on PR #12112:
URL: https://github.com/apache/gluten/pull/12112#issuecomment-4486448043

   
     Comment 1 — Test coverage gap
   
     The new sentinel buildFilter tests currently use AttributeReference("c", 
StringType)(), which is the default binary-collation StringType. That does not 
exercise  the actual non-binary collation path this PR is trying to fix. 
SimpleMetricsCachedBatchSerializer.buildFilter builds lower/upper-bound 
attributes from the cached  attribute data type, so for the real failure case 
comparisons are collation-aware, not binary byte-order comparisons.
   
     Could you add test coverage with an explicit non-binary collation 
StringType (for example UTF8_BINARY_LCASE/UNICODE, depending on the supported 
Spark profile) so  EqualTo, In, StartsWith, and range predicates are evaluated 
through the same collation-aware comparison path as production?
   
     Comment 2 — Sentinel upper bound correctness under non-binary collation
   
     The sentinel upper bound is currently 256 bytes of 0xff. That is a safe 
“max” value for binary byte ordering, but for non-binary collations Spark 
comparisons are  collation-aware. Those bytes are invalid UTF-8 and will not 
necessarily compare as a universal maximum value under every Spark collation. 
If a literal can compare  greater than this sentinel under a non-binary 
collation, the parent buildFilter may still prune the batch incorrectly.
   
     A safer design would be to make the Velox serializer’s buildFilter wrapper 
ignore/minimize min/max pruning for non-binary-collation string attributes, 
rather than  fabricating synthetic min/max bounds. That would preserve 
correctness by passing through batches whenever the predicate depends on 
unsupported string ordering  stats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to