Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

via GitHub Thu, 11 Apr 2024 06:49:29 -0700


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1561054469



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
    */
-
   public static StringSearch getStringSearch(
-      final UTF8String left,
-      final UTF8String right,
+      final UTF8String targetUTF8String,
+      final UTF8String patternUTF8String,
       final int collationId) {
-    String pattern = right.toString();
-    CharacterIterator target = new StringCharacterIterator(left.toString());
+    String pattern = patternUTF8String.toString();
+    CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
     Collator collator = CollationFactory.fetchCollation(collationId).collator;
     return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   That part becomes a little complicated with expressions that need to 
preserve the original string value, consider StringReplace: 
`'BAbaBA'.replace('ab', 'XX')` with UTF8_BINARY_LCASE should return 'BXXaBA' 
(not 'bxxaba')
   so .toLowerCase version is used for matching, and original version 
(non-Lowercase) is used for replacement
   
   @miland-db's currently 2 open PRs use this variant of StringSearch, although 
it's not clear if that's definitely a good approach. That said, I do think we 
can definitely figure out ways to do this in so that we rely on `UTF8String` 
functions rather than `StringSearch` for UTF8_BINARY_LCASE, but we're yet to 
figure out if that's really more efficient (or even easier to implement) - for 
example, Milan noticed that the `UTF8String` implementation of "translate" 
actually does ".toString()" followed by a bunch of ".substring" calls, while 
StringSearch uses a `CharacterIterator` which should probably be more efficient
   
   we may consider benchmarking this, but in terms of this PR @cloud-fan I 
think we could agree on either:
   1. keeping this version of `getStringSearch` here until we're sure that 
we're not gonna need it for any expression (we could remove it once we've 
swooped over all expressions)
   2. removing this version of `getStringSearch` here until we're sure that we 
need it somewhere (we could add it back once there's a guaranteed need for it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

Reply via email to