This is an automated email from the ASF dual-hosted git repository.
maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new fb8d01acf166 [SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour
with UTF8_BINARY collation
fb8d01acf166 is described below
commit fb8d01acf1669252bb319686b3879d779a22ab98
Author: viktorluc-db <[email protected]>
AuthorDate: Sun Sep 1 17:15:36 2024 +0200
[SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour with UTF8_BINARY
collation
### What changes were proposed in this pull request?
Changing the way that spark does initCap with respect to UTF8_BINARY
collation.
In this PR, initCap titlecases the first character of every word, and
lowercases every other character. Words are separated only by ASCII space.
Special care is taken when lowercasing Σ, to take into account if it is at
the end of the word(with respect to case-ignorable characters) and should be
lowercased into ς, or in other case into σ(this already works correctly with
the current implementation because lowercasing a whole string handled this, but
in this PR this was handled manually because lowercase function wasn't used).
The key difference between outputs that this PR introduces is:
| input | current_initCap(input) | new_initCap(input) |
|----------|----------|----------|
| İo | İo (I\u0307o) | İo |
| ß fi ffi ff st | ß fi ffi ff st | Ss Fi Ffi Ff St |
These are just some examples, much more mappings are actually affected.
More details about the key changes are in the next section.
This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf,
which is true by default.
### Why are the changes needed?
The previous implementation first lowercases the complete string, and then
titlecases the first character of every word[1].
When titlecasing the first character of every word, it maps a single
codepoint to a single codepoint[2].
This leads to the following behaviour with respect to [1]:
| input | initCap(input) |
|----------|----------|
| İo | İo (I\u0307o) |
In summary, when the lowercase of a first character(for example "İ") in a
word maps onto more than 1 character(for example "I\u0307"), we only consider
the first character("I" in "I\u0307") of that lowercased letter("İ") for
titlecasing instead of that complete character because we titlecase only the
first character in a word after we completely lowercase it.
The behaviour that [2] produces is:
| input | initCap(input) |
|----------|----------|
| ß fi ffi ff st | ß fi ffi ff st |
While the expected output would probably be:
| input | initCap(input) |
|----------|----------|
| ß fi ffi ff st | Ss Fi Ffi Ff St |
which clearly maps titlecase of each of those characters into more than one
character, which is not handled because of [2].
Again, these are just examples and not an exhaustive list of all the
mappings that have been changed.
### Does this PR introduce _any_ user-facing change?
Yes, InitCap expression will now return different results for:
- One-to-many case mapping (e.g. Turkish dotted I, ß, fi)
### How was this patch tested?
Tests in CollationSupportSuite.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #47771 from viktorluc-db/initCap.
Authored-by: viktorluc-db <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
---
.../catalyst/util/CollationAwareUTF8String.java | 215 +++++++++++++++++----
.../spark/sql/catalyst/util/CollationSupport.java | 2 +-
.../catalyst/util/SpecialCodePointConstants.java | 33 ++++
.../org/apache/spark/unsafe/UTF8StringBuilder.java | 29 +++
.../spark/unsafe/types/CollationSupportSuite.java | 81 ++++++--
.../apache/spark/unsafe/types/UTF8StringSuite.java | 25 +++
6 files changed, 337 insertions(+), 48 deletions(-)
diff --git
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index e3821a0b8598..5ed3048fb72b 100644
---
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -17,6 +17,7 @@
package org.apache.spark.sql.catalyst.util;
import com.ibm.icu.lang.UCharacter;
+import com.ibm.icu.lang.UProperty;
import com.ibm.icu.text.BreakIterator;
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
@@ -48,6 +49,16 @@ public class CollationAwareUTF8String {
*/
private static final int MATCH_NOT_FOUND = -1;
+ /**
+ * `COMBINED_ASCII_SMALL_I_COMBINING_DOT` is an internal representation of
the combined
+ * lowercase code point for ASCII lowercase letter i with an additional
combining dot character
+ * (U+0307). This integer value is not a valid code point itself, but rather
an artificial code
+ * point marker used to represent the two lowercase characters that are the
result of converting
+ * the uppercase Turkish dotted letter I with a combining dot character
(U+0130) to lowercase.
+ */
+ private static final int COMBINED_ASCII_SMALL_I_COMBINING_DOT =
+ SpecialCodePointConstants.ASCII_SMALL_I << 16 |
SpecialCodePointConstants.COMBINING_DOT;
+
/**
* Returns whether the target string starts with the specified prefix,
starting from the
* specified position (0-based index referring to character position in
UTF8String), with respect
@@ -105,9 +116,9 @@ public class CollationAwareUTF8String {
} else {
// Use buffered lowercase code point iteration to handle one-to-many
case mappings.
targetCodePoint = getLowercaseCodePoint(targetIterator.next());
- if (targetCodePoint == CODE_POINT_COMBINED_LOWERCASE_I_DOT) {
- targetCodePoint = CODE_POINT_LOWERCASE_I;
- codePointBuffer = CODE_POINT_COMBINING_DOT;
+ if (targetCodePoint == COMBINED_ASCII_SMALL_I_COMBINING_DOT) {
+ targetCodePoint = SpecialCodePointConstants.ASCII_SMALL_I;
+ codePointBuffer = SpecialCodePointConstants.COMBINING_DOT;
}
++matchLength;
}
@@ -207,9 +218,9 @@ public class CollationAwareUTF8String {
} else {
// Use buffered lowercase code point iteration to handle one-to-many
case mappings.
targetCodePoint = getLowercaseCodePoint(targetIterator.next());
- if (targetCodePoint == CODE_POINT_COMBINED_LOWERCASE_I_DOT) {
- targetCodePoint = CODE_POINT_COMBINING_DOT;
- codePointBuffer = CODE_POINT_LOWERCASE_I;
+ if (targetCodePoint == COMBINED_ASCII_SMALL_I_COMBINING_DOT) {
+ targetCodePoint = SpecialCodePointConstants.COMBINING_DOT;
+ codePointBuffer = SpecialCodePointConstants.ASCII_SMALL_I;
}
++matchLength;
}
@@ -461,28 +472,16 @@ public class CollationAwareUTF8String {
*/
private static void appendLowercaseCodePoint(final int codePoint, final
StringBuilder sb) {
int lowercaseCodePoint = getLowercaseCodePoint(codePoint);
- if (lowercaseCodePoint == CODE_POINT_COMBINED_LOWERCASE_I_DOT) {
+ if (lowercaseCodePoint == COMBINED_ASCII_SMALL_I_COMBINING_DOT) {
// Latin capital letter I with dot above is mapped to 2 lowercase
characters.
- sb.appendCodePoint(0x0069);
- sb.appendCodePoint(0x0307);
+ sb.appendCodePoint(SpecialCodePointConstants.ASCII_SMALL_I);
+ sb.appendCodePoint(SpecialCodePointConstants.COMBINING_DOT);
} else {
// All other characters should follow context-unaware ICU single-code
point case mapping.
sb.appendCodePoint(lowercaseCodePoint);
}
}
- /**
- * `CODE_POINT_COMBINED_LOWERCASE_I_DOT` is an internal representation of
the combined lowercase
- * code point for ASCII lowercase letter i with an additional combining dot
character (U+0307).
- * This integer value is not a valid code point itself, but rather an
artificial code point
- * marker used to represent the two lowercase characters that are the result
of converting the
- * uppercase Turkish dotted letter I with a combining dot character (U+0130)
to lowercase.
- */
- private static final int CODE_POINT_LOWERCASE_I = 0x69;
- private static final int CODE_POINT_COMBINING_DOT = 0x307;
- private static final int CODE_POINT_COMBINED_LOWERCASE_I_DOT =
- CODE_POINT_LOWERCASE_I << 16 | CODE_POINT_COMBINING_DOT;
-
/**
* Returns the lowercase version of the provided code point, with special
handling for
* one-to-many case mappings (i.e. characters that map to multiple
characters in lowercase) and
@@ -490,15 +489,15 @@ public class CollationAwareUTF8String {
* the position in the string relative to other characters in lowercase).
*/
private static int getLowercaseCodePoint(final int codePoint) {
- if (codePoint == 0x0130) {
+ if (codePoint == SpecialCodePointConstants.CAPITAL_I_WITH_DOT_ABOVE) {
// Latin capital letter I with dot above is mapped to 2 lowercase
characters.
- return CODE_POINT_COMBINED_LOWERCASE_I_DOT;
+ return COMBINED_ASCII_SMALL_I_COMBINING_DOT;
}
- else if (codePoint == 0x03C2) {
+ else if (codePoint == SpecialCodePointConstants.GREEK_FINAL_SIGMA) {
// Greek final and non-final letter sigma should be mapped the same.
This is achieved by
// mapping Greek small final sigma (U+03C2) to Greek small non-final
sigma (U+03C3). Capital
// letter sigma (U+03A3) is mapped to small non-final sigma (U+03C3) in
the `else` branch.
- return 0x03C3;
+ return SpecialCodePointConstants.GREEK_SMALL_SIGMA;
}
else {
// All other characters should follow context-unaware ICU single-code
point case mapping.
@@ -550,6 +549,152 @@ public class CollationAwareUTF8String {
BreakIterator.getWordInstance(locale)));
}
+ /**
+ * This 'HashMap' is introduced as a performance speedup. Since title-casing
a codepoint can
+ * result in more than a single codepoint, for correctness, we would use
+ * 'UCharacter.toTitleCase(String)' which returns a 'String'. If we use
+ * 'UCharacter.toTitleCase(int)' (the version of the same function which
converts a single
+ * codepoint to its title-case codepoint), it would be faster than the
previously mentioned
+ * version, but the problem here is that we don't handle when title-casing a
codepoint yields more
+ * than 1 codepoint. Since there are only 48 codepoints that are mapped to
more than 1 codepoint
+ * when title-cased, they are precalculated here, so that the faster
function for title-casing
+ * could be used in combination with this 'HashMap' in the method
'appendCodepointToTitleCase'.
+ */
+ private static final HashMap<Integer, String>
codepointOneToManyTitleCaseLookupTable =
+ new HashMap<>(){{
+ StringBuilder sb = new StringBuilder();
+ for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; ++i)
{
+ sb.appendCodePoint(i);
+ String titleCase = UCharacter.toTitleCase(sb.toString(), null);
+ if (titleCase.codePointCount(0, titleCase.length()) > 1) {
+ put(i, titleCase);
+ }
+ sb.setLength(0);
+ }
+ }};
+
+ /**
+ * Title-casing a string using ICU case mappings. Iterates over the string
and title-cases
+ * the first character in each word, and lowercases every other character.
Handles lowercasing
+ * capital Greek letter sigma ('Σ') separately, taking into account if it
should be a small final
+ * Greek sigma ('ς') or small non-final Greek sigma ('σ'). Words are
separated by ASCII
+ * space(\u0020).
+ *
+ * @param source UTF8String to be title cased
+ * @return title cased source
+ */
+ public static UTF8String toTitleCaseICU(UTF8String source) {
+ // In the default UTF8String implementation, `toLowerCase` method
implicitly does UTF8String
+ // validation (replacing invalid UTF-8 byte sequences with Unicode
replacement character
+ // U+FFFD), but now we have to do the validation manually.
+ source = source.makeValid();
+
+ // Building the title cased source with 'sb'.
+ UTF8StringBuilder sb = new UTF8StringBuilder();
+
+ // 'isNewWord' is true if the current character is the beginning of a
word, false otherwise.
+ boolean isNewWord = true;
+ // We are maintaining if the current character is preceded by a cased
letter.
+ // This is used when lowercasing capital Greek letter sigma ('Σ'), to
figure out if it should be
+ // lowercased into σ or ς.
+ boolean precededByCasedLetter = false;
+
+ // 'offset' is a byte offset in source's byte array pointing to the
beginning of the character
+ // that we need to process next.
+ int offset = 0;
+ int len = source.numBytes();
+
+ while (offset < len) {
+ // We will actually call 'codePointFrom()' 2 times for each character in
the worst case (once
+ // here, and once in 'followedByCasedLetter'). Example of a string where
we call it 2 times
+ // for almost every character is 'ΣΣΣΣΣ' (a string consisting only of
Greek capital sigma)
+ // and 'Σ`````' (a string consisting of a Greek capital sigma, followed
by case-ignorable
+ // characters).
+ int codepoint = source.codePointFrom(offset);
+ // Appending the correctly cased character onto 'sb'.
+ appendTitleCasedCodepoint(sb, codepoint, isNewWord,
precededByCasedLetter, source, offset);
+ // Updating 'isNewWord', 'precededByCasedLetter' and 'offset' to be
ready for the next
+ // character that we will process.
+ isNewWord = (codepoint == SpecialCodePointConstants.ASCII_SPACE);
+ if (!UCharacter.hasBinaryProperty(codepoint, UProperty.CASE_IGNORABLE)) {
+ precededByCasedLetter = UCharacter.hasBinaryProperty(codepoint,
UProperty.CASED);
+ }
+ offset += UTF8String.numBytesForFirstByte(source.getByte(offset));
+ }
+ return sb.build();
+ }
+
+ private static void appendTitleCasedCodepoint(
+ UTF8StringBuilder sb,
+ int codepoint,
+ boolean isAfterAsciiSpace,
+ boolean precededByCasedLetter,
+ UTF8String source,
+ int offset) {
+ if (isAfterAsciiSpace) {
+ // Title-casing a character if it is in the beginning of a new word.
+ appendCodepointToTitleCase(sb, codepoint);
+ return;
+ }
+ if (codepoint == SpecialCodePointConstants.GREEK_CAPITAL_SIGMA) {
+ // Handling capital Greek letter sigma ('Σ').
+ appendLowerCasedGreekCapitalSigma(sb, precededByCasedLetter, source,
offset);
+ return;
+ }
+ // If it's not the beginning of a word, or a capital Greek letter sigma
('Σ'), we lowercase the
+ // character. We specially handle 'CAPITAL_I_WITH_DOT_ABOVE'.
+ if (codepoint == SpecialCodePointConstants.CAPITAL_I_WITH_DOT_ABOVE) {
+ sb.appendCodePoint(SpecialCodePointConstants.ASCII_SMALL_I);
+ sb.appendCodePoint(SpecialCodePointConstants.COMBINING_DOT);
+ return;
+ }
+ sb.appendCodePoint(UCharacter.toLowerCase(codepoint));
+ }
+
+ private static void appendLowerCasedGreekCapitalSigma(
+ UTF8StringBuilder sb,
+ boolean precededByCasedLetter,
+ UTF8String source,
+ int offset) {
+ int codepoint = (!followedByCasedLetter(source, offset) &&
precededByCasedLetter)
+ ? SpecialCodePointConstants.GREEK_FINAL_SIGMA
+ : SpecialCodePointConstants.GREEK_SMALL_SIGMA;
+ sb.appendCodePoint(codepoint);
+ }
+
+ /**
+ * Checks if the character beginning at 'offset'(in 'sources' byte array) is
followed by a cased
+ * letter.
+ */
+ private static boolean followedByCasedLetter(UTF8String source, int offset) {
+ // Moving the offset one character forward, so we could start the linear
search from there.
+ offset += UTF8String.numBytesForFirstByte(source.getByte(offset));
+ int len = source.numBytes();
+
+ while (offset < len) {
+ int codepoint = source.codePointFrom(offset);
+
+ if (UCharacter.hasBinaryProperty(codepoint, UProperty.CASE_IGNORABLE)) {
+ offset += UTF8String.numBytesForFirstByte(source.getByte(offset));
+ continue;
+ }
+ return UCharacter.hasBinaryProperty(codepoint, UProperty.CASED);
+ }
+ return false;
+ }
+
+ /**
+ * Appends title-case of a single character to a 'StringBuilder' using the
ICU root locale rules.
+ */
+ private static void appendCodepointToTitleCase(UTF8StringBuilder sb, int
codepoint) {
+ String toTitleCase = codepointOneToManyTitleCaseLookupTable.get(codepoint);
+ if (toTitleCase == null) {
+ sb.appendCodePoint(UCharacter.toTitleCase(codepoint));
+ } else {
+ sb.append(toTitleCase);
+ }
+ }
+
/*
* Returns the position of the first occurrence of the match string in the
set string,
* counting ASCII commas as delimiters. The match string is compared in a
collation-aware manner,
@@ -843,11 +988,11 @@ public class CollationAwareUTF8String {
}
// Special handling for letter i (U+0069) followed by a combining dot
(U+0307). By ensuring
// that `CODE_POINT_LOWERCASE_I` is buffered, we guarantee finding a
max-length match.
- if (lowercaseDict.containsKey(CODE_POINT_COMBINED_LOWERCASE_I_DOT) &&
- codePoint == CODE_POINT_LOWERCASE_I && inputIter.hasNext()) {
+ if (lowercaseDict.containsKey(COMBINED_ASCII_SMALL_I_COMBINING_DOT)
+ && codePoint == SpecialCodePointConstants.ASCII_SMALL_I &&
inputIter.hasNext()) {
int nextCodePoint = inputIter.next();
- if (nextCodePoint == CODE_POINT_COMBINING_DOT) {
- codePoint = CODE_POINT_COMBINED_LOWERCASE_I_DOT;
+ if (nextCodePoint == SpecialCodePointConstants.COMBINING_DOT) {
+ codePoint = COMBINED_ASCII_SMALL_I_COMBINING_DOT;
} else {
codePointBuffer = nextCodePoint;
}
@@ -1007,11 +1152,11 @@ public class CollationAwareUTF8String {
codePoint = getLowercaseCodePoint(srcIter.next());
}
// Special handling for Turkish dotted uppercase letter I.
- if (codePoint == CODE_POINT_LOWERCASE_I && srcIter.hasNext() &&
- trimChars.contains(CODE_POINT_COMBINED_LOWERCASE_I_DOT)) {
+ if (codePoint == SpecialCodePointConstants.ASCII_SMALL_I &&
srcIter.hasNext() &&
+ trimChars.contains(COMBINED_ASCII_SMALL_I_COMBINING_DOT)) {
codePointBuffer = codePoint;
codePoint = getLowercaseCodePoint(srcIter.next());
- if (codePoint == CODE_POINT_COMBINING_DOT) {
+ if (codePoint == SpecialCodePointConstants.COMBINING_DOT) {
searchIndex += 2;
codePointBuffer = -1;
} else if (trimChars.contains(codePointBuffer)) {
@@ -1125,11 +1270,11 @@ public class CollationAwareUTF8String {
codePoint = getLowercaseCodePoint(srcIter.next());
}
// Special handling for Turkish dotted uppercase letter I.
- if (codePoint == CODE_POINT_COMBINING_DOT && srcIter.hasNext() &&
- trimChars.contains(CODE_POINT_COMBINED_LOWERCASE_I_DOT)) {
+ if (codePoint == SpecialCodePointConstants.COMBINING_DOT &&
srcIter.hasNext() &&
+ trimChars.contains(COMBINED_ASCII_SMALL_I_COMBINING_DOT)) {
codePointBuffer = codePoint;
codePoint = getLowercaseCodePoint(srcIter.next());
- if (codePoint == CODE_POINT_LOWERCASE_I) {
+ if (codePoint == SpecialCodePointConstants.ASCII_SMALL_I) {
searchIndex -= 2;
codePointBuffer = -1;
} else if (trimChars.contains(codePointBuffer)) {
diff --git
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
index 651683796877..f05d9e512568 100644
---
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
+++
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
@@ -283,7 +283,7 @@ public final class CollationSupport {
return v.toLowerCase().toTitleCase();
}
public static UTF8String execBinaryICU(final UTF8String v) {
- return CollationAwareUTF8String.toLowerCase(v).toTitleCaseICU();
+ return CollationAwareUTF8String.toTitleCaseICU(v);
}
public static UTF8String execLowercase(final UTF8String v) {
return CollationAwareUTF8String.toTitleCase(v);
diff --git
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/SpecialCodePointConstants.java
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/SpecialCodePointConstants.java
new file mode 100644
index 000000000000..db615d747910
--- /dev/null
+++
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/SpecialCodePointConstants.java
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util;
+
+/**
+ * 'SpecialCodePointConstants' is introduced in order to keep the codepoints
used in
+ * 'CollationAwareUTF8String' in one place.
+ */
+public class SpecialCodePointConstants {
+
+ public static final int COMBINING_DOT = 0x0307;
+ public static final int ASCII_SMALL_I = 0x0069;
+ public static final int ASCII_SPACE = 0x0020;
+ public static final int GREEK_CAPITAL_SIGMA = 0x03A3;
+ public static final int GREEK_SMALL_SIGMA = 0x03C3;
+ public static final int GREEK_FINAL_SIGMA = 0x03C2;
+ public static final int CAPITAL_I_WITH_DOT_ABOVE = 0x0130;
+}
diff --git
a/common/unsafe/src/main/java/org/apache/spark/unsafe/UTF8StringBuilder.java
b/common/unsafe/src/main/java/org/apache/spark/unsafe/UTF8StringBuilder.java
index 481ea89090b2..3a697345ce1f 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/UTF8StringBuilder.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/UTF8StringBuilder.java
@@ -96,4 +96,33 @@ public class UTF8StringBuilder {
public UTF8String build() {
return UTF8String.fromBytes(buffer, 0, totalSize());
}
+
+ public void appendCodePoint(int codePoint) {
+ if (codePoint <= 0x7F) {
+ grow(1);
+ buffer[cursor - Platform.BYTE_ARRAY_OFFSET] = (byte) codePoint;
+ ++cursor;
+ } else if (codePoint <= 0x7FF) {
+ grow(2);
+ buffer[cursor - Platform.BYTE_ARRAY_OFFSET] = (byte) (0xC0 | (codePoint
>> 6));
+ buffer[cursor + 1 - Platform.BYTE_ARRAY_OFFSET] = (byte) (0x80 |
(codePoint & 0x3F));
+ cursor += 2;
+ } else if (codePoint <= 0xFFFF) {
+ grow(3);
+ buffer[cursor - Platform.BYTE_ARRAY_OFFSET] = (byte) (0xE0 | (codePoint
>> 12));
+ buffer[cursor + 1 - Platform.BYTE_ARRAY_OFFSET] = (byte) (0x80 |
((codePoint >> 6) & 0x3F));
+ buffer[cursor + 2 - Platform.BYTE_ARRAY_OFFSET] = (byte) (0x80 |
(codePoint & 0x3F));
+ cursor += 3;
+ } else if (codePoint <= 0x10FFFF) {
+ grow(4);
+ buffer[cursor - Platform.BYTE_ARRAY_OFFSET] = (byte) (0xF0 | (codePoint
>> 18));
+ buffer[cursor + 1 - Platform.BYTE_ARRAY_OFFSET] = (byte) (0x80 |
((codePoint >> 12) & 0x3F));
+ buffer[cursor + 2 - Platform.BYTE_ARRAY_OFFSET] = (byte) (0x80 |
((codePoint >> 6) & 0x3F));
+ buffer[cursor + 3 - Platform.BYTE_ARRAY_OFFSET] = (byte) (0x80 |
(codePoint & 0x3F));
+ cursor += 4;
+ } else {
+ throw new IllegalArgumentException("Invalid Unicode codePoint: " +
codePoint);
+ }
+ }
+
}
diff --git
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
index 5cc975d38d4d..5719303a0dce 100644
---
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
+++
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
@@ -1334,6 +1334,23 @@ public class CollationSupportSuite {
// Note: results should be the same in these tests for both ICU and
JVM-based implementations.
}
+ private void assertInitCap(
+ String target,
+ String collationName,
+ String expectedICU,
+ String expectedNonICU) throws SparkException {
+ UTF8String target_utf8 = UTF8String.fromString(target);
+ UTF8String expectedICU_utf8 = UTF8String.fromString(expectedICU);
+ UTF8String expectedNonICU_utf8 = UTF8String.fromString(expectedNonICU);
+ int collationId = CollationFactory.collationNameToId(collationName);
+ // Testing the new ICU-based implementation of the Lower function.
+ assertEquals(expectedICU_utf8, CollationSupport.InitCap.exec(target_utf8,
collationId, true));
+ // Testing the old JVM-based implementation of the Lower function.
+ assertEquals(expectedNonICU_utf8,
CollationSupport.InitCap.exec(target_utf8, collationId,
+ false));
+ // Note: results should be the same in these tests for both ICU and
JVM-based implementations.
+ }
+
@Test
public void testInitCap() throws SparkException {
for (String collationName: testSupportedCollations) {
@@ -1372,12 +1389,22 @@ public class CollationSupportSuite {
assertInitCap("ÄBĆΔE", "UTF8_LCASE", "Äbćδe");
assertInitCap("ÄBĆΔE", "UNICODE", "Äbćδe");
assertInitCap("ÄBĆΔE", "UNICODE_CI", "Äbćδe");
+ // Case-variable character length
+ assertInitCap("İo", "UTF8_BINARY", "İo", "I\u0307o");
+ assertInitCap("İo", "UTF8_LCASE", "İo");
+ assertInitCap("İo", "UNICODE", "İo");
+ assertInitCap("İo", "UNICODE_CI", "İo");
+ assertInitCap("i\u0307o", "UTF8_BINARY", "I\u0307o");
+ assertInitCap("i\u0307o", "UTF8_LCASE", "I\u0307o");
+ assertInitCap("i\u0307o", "UNICODE", "I\u0307o");
+ assertInitCap("i\u0307o", "UNICODE_CI", "I\u0307o");
+ // Different possible word boundaries
assertInitCap("aB 世 de", "UTF8_BINARY", "Ab 世 De");
assertInitCap("aB 世 de", "UTF8_LCASE", "Ab 世 De");
assertInitCap("aB 世 de", "UNICODE", "Ab 世 De");
assertInitCap("aB 世 de", "UNICODE_CI", "Ab 世 De");
// One-to-many case mapping (e.g. Turkish dotted I).
- assertInitCap("İ", "UTF8_BINARY", "I\u0307");
+ assertInitCap("İ", "UTF8_BINARY", "İ", "I\u0307");
assertInitCap("İ", "UTF8_LCASE", "İ");
assertInitCap("İ", "UNICODE", "İ");
assertInitCap("İ", "UNICODE_CI", "İ");
@@ -1385,7 +1412,7 @@ public class CollationSupportSuite {
assertInitCap("I\u0307", "UTF8_LCASE","I\u0307");
assertInitCap("I\u0307", "UNICODE","I\u0307");
assertInitCap("I\u0307", "UNICODE_CI","I\u0307");
- assertInitCap("İonic", "UTF8_BINARY", "I\u0307onic");
+ assertInitCap("İonic", "UTF8_BINARY", "İonic", "I\u0307onic");
assertInitCap("İonic", "UTF8_LCASE", "İonic");
assertInitCap("İonic", "UNICODE", "İonic");
assertInitCap("İonic", "UNICODE_CI", "İonic");
@@ -1414,23 +1441,24 @@ public class CollationSupportSuite {
assertInitCap("𝔸", "UTF8_LCASE", "𝔸");
assertInitCap("𝔸", "UNICODE", "𝔸");
assertInitCap("𝔸", "UNICODE_CI", "𝔸");
- assertInitCap("𐐅", "UTF8_BINARY", "𐐭");
+ assertInitCap("𐐅", "UTF8_BINARY", "\uD801\uDC05", "𐐭");
assertInitCap("𐐅", "UTF8_LCASE", "𐐅");
assertInitCap("𐐅", "UNICODE", "𐐅");
assertInitCap("𐐅", "UNICODE_CI", "𐐅");
- assertInitCap("𐐭", "UTF8_BINARY", "𐐭");
+ assertInitCap("𐐭", "UTF8_BINARY", "\uD801\uDC05", "𐐭");
assertInitCap("𐐭", "UTF8_LCASE", "𐐅");
assertInitCap("𐐭", "UNICODE", "𐐅");
assertInitCap("𐐭", "UNICODE_CI", "𐐅");
- assertInitCap("𐐭𝔸", "UTF8_BINARY", "𐐭𝔸");
+ assertInitCap("𐐭𝔸", "UTF8_BINARY", "\uD801\uDC05\uD835\uDD38", "𐐭𝔸");
assertInitCap("𐐭𝔸", "UTF8_LCASE", "𐐅𝔸");
assertInitCap("𐐭𝔸", "UNICODE", "𐐅𝔸");
assertInitCap("𐐭𝔸", "UNICODE_CI", "𐐅𝔸");
// Ligatures.
- assertInitCap("ß fi ffi ff st ῗ", "UTF8_BINARY","ß fi ffi ff st ῗ");
- assertInitCap("ß fi ffi ff st ῗ", "UTF8_LCASE","Ss Fi Ffi Ff St
\u0399\u0308\u0342");
- assertInitCap("ß fi ffi ff st ῗ", "UNICODE","Ss Fi Ffi Ff St
\u0399\u0308\u0342");
- assertInitCap("ß fi ffi ff st ῗ", "UNICODE","Ss Fi Ffi Ff St
\u0399\u0308\u0342");
+ assertInitCap("ß fi ffi ff st ῗ", "UTF8_BINARY", "Ss Fi Ffi Ff St Ϊ͂", "ß fi ffi
ff st ῗ");
+ assertInitCap("ß fi ffi ff st ῗ", "UTF8_LCASE", "Ss Fi Ffi Ff St
\u0399\u0308\u0342");
+ assertInitCap("ß fi ffi ff st ῗ", "UNICODE", "Ss Fi Ffi Ff St
\u0399\u0308\u0342");
+ assertInitCap("ß fi ffi ff st ῗ", "UNICODE", "Ss Fi Ffi Ff St
\u0399\u0308\u0342");
+ assertInitCap("œ ǽ", "UTF8_BINARY", "Œ Ǽ", "Œ Ǽ");
// Different possible word boundaries.
assertInitCap("a b c", "UTF8_BINARY", "A B C");
assertInitCap("a b c", "UNICODE", "A B C");
@@ -1458,13 +1486,42 @@ public class CollationSupportSuite {
assertInitCap("džaba Ljubav NJegova", "UTF8_LCASE", "Džaba Ljubav Njegova");
assertInitCap("džaba Ljubav NJegova", "UNICODE_CI", "Džaba Ljubav Njegova");
assertInitCap("ß fi ffi ff st ΣΗΜΕΡΙΝΟΣ ΑΣΗΜΕΝΙΟΣ İOTA", "UTF8_BINARY",
- "ß fi ffi ff st Σημερινος Ασημενιος I\u0307ota");
+ "Ss Fi Ffi Ff St Σημερινος Ασημενιος İota","ß fi ffi ff st Σημερινος
Ασημενιος I\u0307ota");
assertInitCap("ß fi ffi ff st ΣΗΜΕΡΙΝΟΣ ΑΣΗΜΕΝΙΟΣ İOTA", "UTF8_LCASE",
"Ss Fi Ffi Ff St Σημερινος Ασημενιος İota");
assertInitCap("ß fi ffi ff st ΣΗΜΕΡΙΝΟΣ ΑΣΗΜΕΝΙΟΣ İOTA", "UNICODE",
"Ss Fi Ffi Ff St Σημερινος Ασημενιος İota");
- assertInitCap("ß fi ffi ff st ΣΗΜΕΡΙΝΟΣ ΑΣΗΜΕΝΙΟΣ İOTA", "UNICODE_CI",
- "Ss Fi Ffi Ff St Σημερινος Ασημενιος İota");
+ assertInitCap("ß fi ffi ff st ΣΗΜΕΡςΙΝΟΣ ΑΣΗΜΕΝΙΟΣ İOTA", "UNICODE_CI",
+ "Ss Fi Ffi Ff St Σημερςινος Ασημενιος İota");
+ // Characters that map to multiple characters when titlecased and
lowercased.
+ assertInitCap("ß fi ffi ff st İOTA", "UTF8_BINARY", "Ss Fi Ffi Ff St İota", "ß
fi ffi ff st İota");
+ assertInitCap("ß fi ffi ff st OİOTA", "UTF8_BINARY",
+ "Ss Fi Ffi Ff St Oi\u0307ota", "ß fi ffi ff st Oi̇ota");
+ // Lowercasing Greek letter sigma ('Σ') when case-ignorable character
present.
+ assertInitCap("`Σ", "UTF8_BINARY", "`σ", "`σ");
+ assertInitCap("1`Σ`` AΣ", "UTF8_BINARY", "1`σ`` Aς", "1`σ`` Aς");
+ assertInitCap("a1`Σ``", "UTF8_BINARY", "A1`σ``", "A1`σ``");
+ assertInitCap("a`Σ``", "UTF8_BINARY", "A`ς``", "A`σ``");
+ assertInitCap("a`Σ``1", "UTF8_BINARY", "A`ς``1", "A`σ``1");
+ assertInitCap("a`Σ``A", "UTF8_BINARY", "A`σ``a", "A`σ``a");
+ assertInitCap("ΘΑ�Σ�ΟΣ�", "UTF8_BINARY", "Θα�σ�ος�", "Θα�σ�ος�");
+ assertInitCap("ΘΑᵩΣ�ΟᵩΣᵩ�", "UTF8_BINARY", "Θαᵩς�οᵩςᵩ�", "Θαᵩς�οᵩςᵩ�");
+ assertInitCap("ΘΑ�ᵩΣ�ΟᵩΣᵩ�", "UTF8_BINARY", "Θα�ᵩσ�οᵩςᵩ�", "Θα�ᵩσ�οᵩςᵩ�");
+ assertInitCap("ΘΑ�ᵩΣᵩ�ΟᵩΣᵩ�", "UTF8_BINARY", "Θα�ᵩσᵩ�οᵩςᵩ�",
"Θα�ᵩσᵩ�οᵩςᵩ�");
+ assertInitCap("ΘΑ�Σ�Ο�Σ�", "UTF8_BINARY", "Θα�σ�ο�σ�", "Θα�σ�ο�σ�");
+ // Disallowed bytes and invalid sequences.
+ assertInitCap(UTF8String.fromBytes(new byte[] { (byte)0xC0, (byte)0xC1,
(byte)0xF5}).toString(),
+ "UTF8_BINARY", "���", "���");
+ assertInitCap(UTF8String.fromBytes(
+ new byte[]{(byte)0xC0, (byte)0xC1, (byte)0xF5, 0x20, 0x61, 0x41,
(byte)0xC0}).toString(),
+ "UTF8_BINARY",
+ "��� Aa�", "��� Aa�");
+ assertInitCap(UTF8String.fromBytes(new
byte[]{(byte)0xC2,(byte)0xC2}).toString(),
+ "UTF8_BINARY", "��", "��");
+ assertInitCap(UTF8String.fromBytes(
+ new byte[]{0x61, 0x41, (byte)0xC2, (byte)0xC2, 0x41}).toString(),
+ "UTF8_BINARY",
+ "Aa��a", "Aa��a");
}
/**
diff --git
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
index 2428d40fe801..c4a66fdffdd4 100644
---
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
+++
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
@@ -26,6 +26,8 @@ import java.util.*;
import com.google.common.collect.ImmutableMap;
import org.apache.spark.unsafe.Platform;
+import org.apache.spark.unsafe.UTF8StringBuilder;
+
import org.junit.jupiter.api.Test;
import static org.apache.spark.unsafe.types.UTF8String.fromString;
@@ -1362,4 +1364,27 @@ public class UTF8StringSuite {
UTF8String.fromString("111111111111111111111111111111111111111111111111111111111111111"),
UTF8String.toBinaryString(Long.MAX_VALUE));
}
+
+ /**
+ * This tests whether appending a codepoint to a 'UTF8StringBuilder'
correctly appends every
+ * single codepoint. We test it against an already existing
'StringBuilder.appendCodePoint' and
+ * 'UTF8String.fromString'. We skip testing the surrogate codepoints because
at some point while
+ * converting the surrogate codepoint to 'UTF8String' (via 'StringBuilder'
and 'UTF8String') we
+ * get an ill-formated byte sequence (probably because 'String' is in UTF-16
format, and a single
+ * surrogate codepoint is handled differently in UTF-16 than in UTF-8, so
somewhere during those
+ * conversions some different behaviour happens).
+ */
+ @Test
+ public void testAppendCodepointToUTF8StringBuilder() {
+ int surrogateRangeLowerBound = 0xD800;
+ int surrogateRangeUpperBound = 0xDFFF;
+ for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; ++i)
{
+ if(surrogateRangeLowerBound <= i && i <= surrogateRangeUpperBound)
continue;
+ UTF8StringBuilder usb = new UTF8StringBuilder();
+ usb.appendCodePoint(i);
+ StringBuilder sb = new StringBuilder();
+ sb.appendCodePoint(i);
+ assert(usb.build().equals(UTF8String.fromString(sb.toString())));
+ }
+ }
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]