lidavidm commented on a change in pull request #10317:
URL: https://github.com/apache/arrow/pull/10317#discussion_r633767010
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2317,6 +2372,21 @@ const FunctionDoc utf8_lower_doc(
"Transform input to lowercase",
("For each string in `strings`, return a lowercase version."),
{"strings"});
+const FunctionDoc ascii_reverse_doc(
+ "Reverse ASCII input",
+ ("For each ASCII string in `strings`, return a reversed version.\n\n"
+ "This function assumes the input is fully ASCII. If it may contain\n"
+ "non-ASCII characters, use \"utf8_reverse\" instead."),
+ {"strings"});
+
+const FunctionDoc utf8_reverse_doc(
+ "Reverse utf8 input",
+ ("For each utf8 string in `strings`, return a reversed version.\n\n"
+ "This function works on UTF8 codepoint level. As a consequence, if the
input \n"
+ "contains combining UTF8 character sequences/ 'graphemes', they would
also \n"
+ "be reversed."),
Review comment:
Just to reword things a bit for clarity perhaps:
> Reverse input
> For each string in `strings`, return a reversed version.
> This function operates on codepoints/UTF-8 code units, not grapheme
clusters. Hence, it will not correctly reverse grapheme clusters composed of
multiple codepoints.
As we assume strings are UTF-8 and don't need to repeat that.
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -266,6 +266,54 @@ void EnsureLookupTablesFilled() {}
#endif // ARROW_WITH_UTF8PROC
+template <typename Type>
+struct AsciiReverse : StringTransform<Type, AsciiReverse<Type>> {
+ using Base = StringTransform<Type, AsciiReverse<Type>>;
+ using offset_type = typename Base::offset_type;
+
+ bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+ uint8_t utf8_char_found = 0;
+ for (offset_type i = 0; i < input_string_ncodeunits; i++) {
+ // if a utf8 char is found, report to utf8_char_found
+ utf8_char_found |= input[i] & 0x80;
+ output[input_string_ncodeunits - i - 1] = input[i];
+ }
+ *output_written = input_string_ncodeunits;
+ return utf8_char_found == 0;
+ }
+};
+
+/*
+ * size of a valid UTF8 can be determined by looking at leading 4 bits of BYTE1
+ * UTF8_BYTE_SIZE_LUT[0..7] --> pure ascii chars --> 1B length
+ * UTF8_BYTE_SIZE_LUT[8..11] --> internal bytes --> 1B length
+ * UTF8_BYTE_SIZE_LUT[12,13] --> 2B long UTF8 chars
+ * UTF8_BYTE_SIZE_LUT[14] --> 3B long UTF8 chars
+ * UTF8_BYTE_SIZE_LUT[15] --> 4B long UTF8 chars
+ */
+const std::array<uint8_t, 16> UTF8_BYTE_SIZE_LUT{1, 1, 1, 1, 1, 1, 1, 1,
+ 1, 1, 1, 1, 2, 2, 3, 4};
+
+template <typename Type>
+struct Utf8Reverse : StringTransform<Type, Utf8Reverse<Type>> {
+ using Base = StringTransform<Type, Utf8Reverse<Type>>;
+ using offset_type = typename Base::offset_type;
+
+ bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+ offset_type i = 0;
+ while (i < input_string_ncodeunits) {
+ uint8_t offset = UTF8_BYTE_SIZE_LUT[input[i] >> 4];
+ offset_type stride = std::min(i + offset, input_string_ncodeunits);
+ std::copy(input + i, input + stride, output + input_string_ncodeunits -
stride);
+ i += offset;
+ }
+ *output_written = input_string_ncodeunits;
+ return true;
Review comment:
I don't think we need to explicitly validate. Note the user themselves
can call ValidateFull() which IIRC will also check for UTF-8.
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -266,6 +271,56 @@ void EnsureLookupTablesFilled() {}
#endif // ARROW_WITH_UTF8PROC
+template <typename Type>
+struct AsciiReverse : StringTransform<Type, AsciiReverse<Type>> {
+ using Base = StringTransform<Type, AsciiReverse<Type>>;
+ using offset_type = typename Base::offset_type;
+
+ bool Transform(const uint8_t* input, offset_type input_string_ncodeunits,
+ uint8_t* output, offset_type* output_written) {
+ uint8_t utf8_char_found = 0;
+ for (offset_type i = 0; i < input_string_ncodeunits; i++) {
+ // if a utf8 char is found, report to utf8_char_found
+ utf8_char_found |= input[i] & 0x80;
+ output[input_string_ncodeunits - i - 1] = input[i];
+ }
+ *output_written = input_string_ncodeunits;
+ return utf8_char_found == 0;
+ }
+
+ static Status InvalidStatus() { return Status::Invalid("Non-ascii sequence
in input"); }
Review comment:
Nit: can we be consistent with `ASCII`?
##########
File path: docs/source/cpp/compute.rst
##########
@@ -435,47 +435,59 @@ String transforms
+==========================+============+=========================+=====================+=========+=======================================+
| ascii_lower | Unary | String-like |
String-like | \(1) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| ascii_reverse | Unary | String-like |
String-like | \(2) | |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
| ascii_upper | Unary | String-like |
String-like | \(1) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| binary_length | Unary | Binary- or String-like | Int32 or
Int64 | \(2) | |
+| binary_length | Unary | Binary- or String-like | Int32 or
Int64 | \(3) | |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| replace_substring | Unary | String-like |
String-like | \(4) | :struct:`ReplaceSubstringOptions` |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring | Unary | String-like |
String-like | \(3) | :struct:`ReplaceSubstringOptions` |
+| replace_substring_regex | Unary | String-like |
String-like | \(5) | :struct:`ReplaceSubstringOptions` |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring_regex | Unary | String-like |
String-like | \(4) | :struct:`ReplaceSubstringOptions` |
+| utf8_length | Unary | String-like | Int32 or
Int64 | \(6) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_length | Unary | String-like | Int32 or
Int64 | \(5) | |
+| utf8_lower | Unary | String-like |
String-like | \(7) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_lower | Unary | String-like |
String-like | \(6) | |
+| utf8_reverse | Unary | String-like |
String-like | \(8) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_upper | Unary | String-like |
String-like | \(6) | |
+| utf8_upper | Unary | String-like |
String-like | \(7) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
* \(1) Each ASCII character in the input is converted to lowercase or
uppercase. Non-ASCII characters are left untouched.
-* \(2) Output is the physical length in bytes of each input element. Output
+* \(2) ASCII input is reversed to the output. If non-ASCII characters
+ are present, ``Invalid`` :class:`Status` will be returned.
+
+* \(3) Output is the physical length in bytes of each input element. Output
type is Int32 for Binary / String, Int64 for LargeBinary / LargeString.
-* \(3) Replace non-overlapping substrings that match to
+* \(4) Replace non-overlapping substrings that match to
:member:`ReplaceSubstringOptions::pattern` by
:member:`ReplaceSubstringOptions::replacement`. If
:member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
maximum number of replacements made, counting from the left.
-* \(4) Replace non-overlapping substrings that match to the regular expression
+* \(5) Replace non-overlapping substrings that match to the regular expression
:member:`ReplaceSubstringOptions::pattern` by
:member:`ReplaceSubstringOptions::replacement`, using the Google RE2
library. If
:member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
maximum number of replacements made, counting from the left. Note that if the
pattern contains groups, backreferencing can be used.
-* \(5) Output is the number of characters (not bytes) of each input element.
+* \(6) Output is the number of characters (not bytes) of each input element.
Output type is Int32 for String, Int64 for LargeString.
-* \(6) Each UTF8-encoded character in the input is converted to lowercase or
+* \(7) Each UTF8-encoded character in the input is converted to lowercase or
uppercase.
+* \(8) UTF8-encoded input is reversed to the output. If malformed-UTF8
characters
+ are present, the Output would be undefined (but the size of Output buffers
would
+ be preserved). If combined UTF8 characters are available (ex: graphems,
glyphs),
+ they would also be reversed.
+
Review comment:
```suggestion
* \(8) Each UTF8-encoded code unit is written in reverse order to the output.
If the input is not valid UTF8, then the output is undefined (but the size
of output
buffers will be preserved).
```
##########
File path: docs/source/cpp/compute.rst
##########
@@ -435,47 +435,59 @@ String transforms
+==========================+============+=========================+=====================+=========+=======================================+
| ascii_lower | Unary | String-like |
String-like | \(1) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| ascii_reverse | Unary | String-like |
String-like | \(2) | |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
| ascii_upper | Unary | String-like |
String-like | \(1) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| binary_length | Unary | Binary- or String-like | Int32 or
Int64 | \(2) | |
+| binary_length | Unary | Binary- or String-like | Int32 or
Int64 | \(3) | |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| replace_substring | Unary | String-like |
String-like | \(4) | :struct:`ReplaceSubstringOptions` |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring | Unary | String-like |
String-like | \(3) | :struct:`ReplaceSubstringOptions` |
+| replace_substring_regex | Unary | String-like |
String-like | \(5) | :struct:`ReplaceSubstringOptions` |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| replace_substring_regex | Unary | String-like |
String-like | \(4) | :struct:`ReplaceSubstringOptions` |
+| utf8_length | Unary | String-like | Int32 or
Int64 | \(6) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_length | Unary | String-like | Int32 or
Int64 | \(5) | |
+| utf8_lower | Unary | String-like |
String-like | \(7) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_lower | Unary | String-like |
String-like | \(6) | |
+| utf8_reverse | Unary | String-like |
String-like | \(8) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
-| utf8_upper | Unary | String-like |
String-like | \(6) | |
+| utf8_upper | Unary | String-like |
String-like | \(7) | |
+--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
* \(1) Each ASCII character in the input is converted to lowercase or
uppercase. Non-ASCII characters are left untouched.
-* \(2) Output is the physical length in bytes of each input element. Output
+* \(2) ASCII input is reversed to the output. If non-ASCII characters
+ are present, ``Invalid`` :class:`Status` will be returned.
+
+* \(3) Output is the physical length in bytes of each input element. Output
type is Int32 for Binary / String, Int64 for LargeBinary / LargeString.
-* \(3) Replace non-overlapping substrings that match to
+* \(4) Replace non-overlapping substrings that match to
:member:`ReplaceSubstringOptions::pattern` by
:member:`ReplaceSubstringOptions::replacement`. If
:member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
maximum number of replacements made, counting from the left.
-* \(4) Replace non-overlapping substrings that match to the regular expression
+* \(5) Replace non-overlapping substrings that match to the regular expression
:member:`ReplaceSubstringOptions::pattern` by
:member:`ReplaceSubstringOptions::replacement`, using the Google RE2
library. If
:member:`ReplaceSubstringOptions::max_replacements` != -1, it determines the
maximum number of replacements made, counting from the left. Note that if the
pattern contains groups, backreferencing can be used.
-* \(5) Output is the number of characters (not bytes) of each input element.
+* \(6) Output is the number of characters (not bytes) of each input element.
Output type is Int32 for String, Int64 for LargeString.
-* \(6) Each UTF8-encoded character in the input is converted to lowercase or
+* \(7) Each UTF8-encoded character in the input is converted to lowercase or
uppercase.
+* \(8) UTF8-encoded input is reversed to the output. If malformed-UTF8
characters
+ are present, the Output would be undefined (but the size of Output buffers
would
+ be preserved). If combined UTF8 characters are available (ex: graphems,
glyphs),
+ they would also be reversed.
+
Review comment:
I don't think we need to try to explain the subtleties of grapheme
clusters, just note that this operates on codepoints/code units, not grapheme
clusters. Proper Unicode handling could fill an entire book.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]