srowen commented on a change in pull request #31164:
URL: https://github.com/apache/spark/pull/31164#discussion_r556975135
##########
File path:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
##########
@@ -1065,16 +1065,19 @@ public UTF8String replace(UTF8String search, UTF8String
replace) {
return buf.build();
}
- // TODO: Need to use `Code Point` here instead of Char in case the character
longer than 2 bytes
- public UTF8String translate(Map<Character, Character> dict) {
+ public UTF8String translate(Map<String, String> dict) {
String srcStr = this.toString();
StringBuilder sb = new StringBuilder();
- for(int k = 0; k< srcStr.length(); k++) {
- if (null == dict.get(srcStr.charAt(k))) {
- sb.append(srcStr.charAt(k));
- } else if ('\0' != dict.get(srcStr.charAt(k))){
- sb.append(dict.get(srcStr.charAt(k)));
+ int charCount = 0;
+ for(int k = 0; k < srcStr.length(); k += charCount) {
+ int codePoint = srcStr.codePointAt(k);
+ charCount = Character.charCount(codePoint);
+ String subStr = srcStr.substring(k, k + charCount);
+ if (null == dict.get(subStr)) {
+ sb.append(subStr);
+ } else if (!"\0".equals(dict.get(subStr))) {
Review comment:
While we're here, we could probably optimize away the two calls to
dict.get(subStr)
##########
File path:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
##########
@@ -1065,16 +1065,19 @@ public UTF8String replace(UTF8String search, UTF8String
replace) {
return buf.build();
}
- // TODO: Need to use `Code Point` here instead of Char in case the character
longer than 2 bytes
- public UTF8String translate(Map<Character, Character> dict) {
+ public UTF8String translate(Map<String, String> dict) {
String srcStr = this.toString();
StringBuilder sb = new StringBuilder();
- for(int k = 0; k< srcStr.length(); k++) {
- if (null == dict.get(srcStr.charAt(k))) {
- sb.append(srcStr.charAt(k));
- } else if ('\0' != dict.get(srcStr.charAt(k))){
- sb.append(dict.get(srcStr.charAt(k)));
+ int charCount = 0;
+ for(int k = 0; k < srcStr.length(); k += charCount) {
Review comment:
Nit: add a space after for
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]