uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1561054469
########## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ########## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { - String pattern = right.toString(); - CharacterIterator target = new StringCharacterIterator(left.toString()); + String pattern = patternUTF8String.toString(); + CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: That part becomes a little complicated with expressions that need to preserve the original string value, consider StringReplace: `'BAbaBA'.replace('ab', 'XX')` with UTF8_BINARY_LCASE should return 'BXXaBA' (not 'bxxaba') so .toLowerCase version is used for matching, and original version (non-Lowercase) is used for replacement @miland-db's currently 2 open PRs use this variant of StringSearch, although it's not clear if that's definitely a good approach. That said, I do think we can definitely figure out ways to do this in so that we rely on `UTF8String` functions rather than `StringSearch` for UTF8_BINARY_LCASE, but we're yet to figure out if that's really more efficient (or even easier to implement) - for example, Milan noticed that the `UTF8String` implementation of "translate" actually does ".toString()" followed by a bunch of ".substring" calls, while StringSearch uses a `CharacterIterator` which should probably be more efficient we may consider benchmarking this, but in terms of this PR @cloud-fan I think we could agree on either: 1. keeping this version of `getStringSearch` here until we're sure that we're not gonna need it for any expression (we could remove it once we've swooped over all expressions) 2. removing this version of `getStringSearch` here until we're sure that we need it somewhere (we could add it back once there's a guaranteed need for it) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org