[GitHub] [arrow] lidavidm commented on a change in pull request #10317: ARROW-12713 [C++] String reverse kernel

GitBox Mon, 17 May 2021 11:43:28 -0700


lidavidm commented on a change in pull request #10317:
URL: https://github.com/apache/arrow/pull/10317#discussion_r633767010




##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2317,6 +2372,21 @@ const FunctionDoc utf8_lower_doc(
     "Transform input to lowercase",
     ("For each string in `strings`, return a lowercase version."), 
{"strings"});
 
+const FunctionDoc ascii_reverse_doc(
+    "Reverse ASCII input",
+    ("For each ASCII string in `strings`, return a reversed version.\n\n"
+     "This function assumes the input is fully ASCII.  If it may contain\n"
+     "non-ASCII characters, use \"utf8_reverse\" instead."),
+    {"strings"});
+
+const FunctionDoc utf8_reverse_doc(
+    "Reverse utf8 input",
+    ("For each utf8 string in `strings`, return a reversed version.\n\n"
+     "This function works on UTF8 codepoint level. As a consequence, if the 
input \n"
+     "contains combining UTF8 character sequences/ 'graphemes', they would 
also \n"
+     "be reversed."),

Review comment:
       Just to reword things a bit for clarity perhaps:
   
   > Reverse input
   > For each string in `strings`, return a reversed version.
   > This function operates on codepoints/UTF-8 code units, not grapheme 
clusters. Hence, it will not correctly reverse grapheme clusters composed of 
multiple codepoints.
   
   As we assume strings are UTF-8 and don't need to repeat that.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -266,6 +266,54 @@ void EnsureLookupTablesFilled() {}
 
 #endif  // ARROW_WITH_UTF8PROC
 
+template <typename Type>
+struct AsciiReverse : StringTransform<Type, AsciiReverse<Type>> {
+  using Base = StringTransform<Type, AsciiReverse<Type>>;
+  using offset_type = typename Base::offset_type;
+
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+                 uint8_t* output, offset_type* output_written) {
+    uint8_t utf8_char_found = 0;
+    for (offset_type i = 0; i < input_string_ncodeunits; i++) {
+      // if a utf8 char is found, report to utf8_char_found
+      utf8_char_found |= input[i] & 0x80;
+      output[input_string_ncodeunits - i - 1] = input[i];
+    }
+    *output_written = input_string_ncodeunits;
+    return utf8_char_found == 0;
+  }
+};
+
+/*
+ * size of a valid UTF8 can be determined by looking at leading 4 bits of BYTE1
+ * UTF8_BYTE_SIZE_LUT[0..7] --> pure ascii chars --> 1B length
+ * UTF8_BYTE_SIZE_LUT[8..11] --> internal bytes --> 1B length
+ * UTF8_BYTE_SIZE_LUT[12,13] --> 2B long UTF8 chars
+ * UTF8_BYTE_SIZE_LUT[14] --> 3B long UTF8 chars
+ * UTF8_BYTE_SIZE_LUT[15] --> 4B long UTF8 chars
+ */
+const std::array<uint8_t, 16> UTF8_BYTE_SIZE_LUT{1, 1, 1, 1, 1, 1, 1, 1,
+                                                 1, 1, 1, 1, 2, 2, 3, 4};
+
+template <typename Type>
+struct Utf8Reverse : StringTransform<Type, Utf8Reverse<Type>> {
+  using Base = StringTransform<Type, Utf8Reverse<Type>>;
+  using offset_type = typename Base::offset_type;
+
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+                 uint8_t* output, offset_type* output_written) {
+    offset_type i = 0;
+    while (i < input_string_ncodeunits) {
+      uint8_t offset = UTF8_BYTE_SIZE_LUT[input[i] >> 4];
+      offset_type stride = std::min(i + offset, input_string_ncodeunits);
+      std::copy(input + i, input + stride, output + input_string_ncodeunits - 
stride);
+      i += offset;
+    }
+    *output_written = input_string_ncodeunits;
+    return true;

Review comment:
       I don't think we need to explicitly validate. Note the user themselves 
can call ValidateFull() which IIRC will also check for UTF-8.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -266,6 +271,56 @@ void EnsureLookupTablesFilled() {}
 
 #endif  // ARROW_WITH_UTF8PROC
 
+template <typename Type>
+struct AsciiReverse : StringTransform<Type, AsciiReverse<Type>> {
+  using Base = StringTransform<Type, AsciiReverse<Type>>;
+  using offset_type = typename Base::offset_type;
+
+  bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+                 uint8_t* output, offset_type* output_written) {
+    uint8_t utf8_char_found = 0;
+    for (offset_type i = 0; i < input_string_ncodeunits; i++) {
+      // if a utf8 char is found, report to utf8_char_found
+      utf8_char_found |= input[i] & 0x80;
+      output[input_string_ncodeunits - i - 1] = input[i];
+    }
+    *output_written = input_string_ncodeunits;
+    return utf8_char_found == 0;
+  }
+
+  static Status InvalidStatus() { return Status::Invalid("Non-ascii sequence 
in input"); }

Review comment:
       Nit: can we be consistent with `ASCII`?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -435,47 +435,59 @@ String transforms
 
+==========================+============+=========================+=====================+=========+=======================================+
 | ascii_lower              | Unary      | String-like             | 
String-like         | \(1)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| ascii_reverse            | Unary      | String-like             | 
String-like         | \(2)    |                                       |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
 | ascii_upper              | Unary      | String-like             | 
String-like         | \(1)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| binary_length            | Unary      | Binary- or String-like  | Int32 or 
Int64      | \(2)    |                                       |
+| binary_length            | Unary      | Binary- or String-like  | Int32 or 
Int64      | \(3)    |                                       |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| replace_substring        | Unary      | String-like             | 
String-like         | \(4)    | :struct:`ReplaceSubstringOptions`     |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring        | Unary      | String-like             | 
String-like         | \(3)    | :struct:`ReplaceSubstringOptions`     |
+| replace_substring_regex  | Unary      | String-like             | 
String-like         | \(5)    | :struct:`ReplaceSubstringOptions`     |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring_regex  | Unary      | String-like             | 
String-like         | \(4)    | :struct:`ReplaceSubstringOptions`     |
+| utf8_length              | Unary      | String-like             | Int32 or 
Int64      | \(6)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_length              | Unary      | String-like             | Int32 or 
Int64      | \(5)    |                                       |
+| utf8_lower               | Unary      | String-like             | 
String-like         | \(7)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_lower               | Unary      | String-like             | 
String-like         | \(6)    |                                       |
+| utf8_reverse             | Unary      | String-like             | 
String-like         | \(8)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_upper               | Unary      | String-like             | 
String-like         | \(6)    |                                       |
+| utf8_upper               | Unary      | String-like             | 
String-like         | \(7)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
 
 
 * \(1) Each ASCII character in the input is converted to lowercase or
   uppercase.  Non-ASCII characters are left untouched.
 
-* \(2) Output is the physical length in bytes of each input element.  Output
+* \(2) ASCII input is reversed to the output. If non-ASCII characters 
+  are present, ``Invalid`` :class:`Status` will be returned.
+
+* \(3) Output is the physical length in bytes of each input element.  Output
   type is Int32 for Binary / String, Int64 for LargeBinary / LargeString.
 
-* \(3) Replace non-overlapping substrings that match to
+* \(4) Replace non-overlapping substrings that match to
   :member:`ReplaceSubstringOptions::pattern` by
   :member:`ReplaceSubstringOptions::replacement`. If
   :member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
   maximum number of replacements made, counting from the left.
 
-* \(4) Replace non-overlapping substrings that match to the regular expression
+* \(5) Replace non-overlapping substrings that match to the regular expression
   :member:`ReplaceSubstringOptions::pattern` by
   :member:`ReplaceSubstringOptions::replacement`, using the Google RE2 
library. If
   :member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
   maximum number of replacements made, counting from the left. Note that if the
   pattern contains groups, backreferencing can be used.
 
-* \(5) Output is the number of characters (not bytes) of each input element.
+* \(6) Output is the number of characters (not bytes) of each input element.
   Output type is Int32 for String, Int64 for LargeString. 
 
-* \(6) Each UTF8-encoded character in the input is converted to lowercase or
+* \(7) Each UTF8-encoded character in the input is converted to lowercase or
   uppercase.
 
+* \(8) UTF8-encoded input is reversed to the output. If malformed-UTF8 
characters 
+  are present, the Output would be undefined (but the size of Output buffers 
would
+  be preserved). If combined UTF8 characters are available (ex: graphems, 
glyphs), 
+  they would also be reversed. 
+

Review comment:
       ```suggestion
   * \(8) Each UTF8-encoded code unit is written in reverse order to the output.
     If the input is not valid UTF8, then the output is undefined (but the size 
of output
     buffers will be preserved).
   
   ```

##########
File path: docs/source/cpp/compute.rst
##########
@@ -435,47 +435,59 @@ String transforms
 
+==========================+============+=========================+=====================+=========+=======================================+
 | ascii_lower              | Unary      | String-like             | 
String-like         | \(1)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| ascii_reverse            | Unary      | String-like             | 
String-like         | \(2)    |                                       |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
 | ascii_upper              | Unary      | String-like             | 
String-like         | \(1)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| binary_length            | Unary      | Binary- or String-like  | Int32 or 
Int64      | \(2)    |                                       |
+| binary_length            | Unary      | Binary- or String-like  | Int32 or 
Int64      | \(3)    |                                       |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| replace_substring        | Unary      | String-like             | 
String-like         | \(4)    | :struct:`ReplaceSubstringOptions`     |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring        | Unary      | String-like             | 
String-like         | \(3)    | :struct:`ReplaceSubstringOptions`     |
+| replace_substring_regex  | Unary      | String-like             | 
String-like         | \(5)    | :struct:`ReplaceSubstringOptions`     |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring_regex  | Unary      | String-like             | 
String-like         | \(4)    | :struct:`ReplaceSubstringOptions`     |
+| utf8_length              | Unary      | String-like             | Int32 or 
Int64      | \(6)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_length              | Unary      | String-like             | Int32 or 
Int64      | \(5)    |                                       |
+| utf8_lower               | Unary      | String-like             | 
String-like         | \(7)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_lower               | Unary      | String-like             | 
String-like         | \(6)    |                                       |
+| utf8_reverse             | Unary      | String-like             | 
String-like         | \(8)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_upper               | Unary      | String-like             | 
String-like         | \(6)    |                                       |
+| utf8_upper               | Unary      | String-like             | 
String-like         | \(7)    |                                       |
 
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
 
 
 * \(1) Each ASCII character in the input is converted to lowercase or
   uppercase.  Non-ASCII characters are left untouched.
 
-* \(2) Output is the physical length in bytes of each input element.  Output
+* \(2) ASCII input is reversed to the output. If non-ASCII characters 
+  are present, ``Invalid`` :class:`Status` will be returned.
+
+* \(3) Output is the physical length in bytes of each input element.  Output
   type is Int32 for Binary / String, Int64 for LargeBinary / LargeString.
 
-* \(3) Replace non-overlapping substrings that match to
+* \(4) Replace non-overlapping substrings that match to
   :member:`ReplaceSubstringOptions::pattern` by
   :member:`ReplaceSubstringOptions::replacement`. If
   :member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
   maximum number of replacements made, counting from the left.
 
-* \(4) Replace non-overlapping substrings that match to the regular expression
+* \(5) Replace non-overlapping substrings that match to the regular expression
   :member:`ReplaceSubstringOptions::pattern` by
   :member:`ReplaceSubstringOptions::replacement`, using the Google RE2 
library. If
   :member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
   maximum number of replacements made, counting from the left. Note that if the
   pattern contains groups, backreferencing can be used.
 
-* \(5) Output is the number of characters (not bytes) of each input element.
+* \(6) Output is the number of characters (not bytes) of each input element.
   Output type is Int32 for String, Int64 for LargeString. 
 
-* \(6) Each UTF8-encoded character in the input is converted to lowercase or
+* \(7) Each UTF8-encoded character in the input is converted to lowercase or
   uppercase.
 
+* \(8) UTF8-encoded input is reversed to the output. If malformed-UTF8 
characters 
+  are present, the Output would be undefined (but the size of Output buffers 
would
+  be preserved). If combined UTF8 characters are available (ex: graphems, 
glyphs), 
+  they would also be reversed. 
+

Review comment:
       I don't think we need to try to explain the subtleties of grapheme 
clusters, just note that this operates on codepoints/code units, not grapheme 
clusters. Proper Unicode handling could fill an entire book.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a change in pull request #10317: ARROW-12713 [C++] String reverse kernel

Reply via email to