alamb commented on code in PR #1529:
URL: https://github.com/apache/arrow-rs/pull/1529#discussion_r846762535
##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -94,8 +94,33 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
Ok(make_array(data))
}
-/// Returns an ArrayRef with a substring starting from `start` and with
optional length `length` of each of the elements in `array`.
-/// `start` can be negative, in which case the start counts from the end of
the string.
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of
the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Warning
+///
+/// This function **might** return in invalid utf-8 format if strings contain
chars whose codepoint > `0x007F`.
+/// ## Example of getting an invalid substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let result = substring(&array, -1, &None).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format
Review Comment:
@HaoYang670 I think these are all good ideas (making safe / unsafe
versions, and adding substring for binary etc). I would be happy to create
tickets for them if that would help
If we are going to change the substring kernel, I suggest we do the
following to follow Rust's safe-by-defauly mantra:
| name | Safety | Performance | Notes |
|--------------|-----------|------------|---------|
| unsafe substring_bytes_unchecked | unsafe | fast| what is currently called
`substring` - user should make sure valid utf8 substrings can be gotten with
`start` and `length` | fast |
| substring_bytes | safe | slow | Implement substring by bytes -
throws error of invalid utf-8 sequence is created
| substring | safe | slow | Implement substring by char,
ensuring all substrings are utf-8
##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -94,8 +94,33 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
Ok(make_array(data))
}
-/// Returns an ArrayRef with a substring starting from `start` and with
optional length `length` of each of the elements in `array`.
-/// `start` can be negative, in which case the start counts from the end of
the string.
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of
the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Warning
+///
+/// This function **might** return in invalid utf-8 format if strings contain
chars whose codepoint > `0x007F`.
+/// ## Example of getting an invalid substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let result = substring(&array, -1, &None).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format
Review Comment:
@HaoYang670 I think these are all good ideas (making safe / unsafe
versions, and adding substring for binary etc). I would be happy to create
tickets for them if that would help
If we are going to change the substring kernel, I suggest we do the
following to follow Rust's safe-by-default mantra:
| name | Safety | Performance | Notes |
|--------------|-----------|------------|---------|
| unsafe substring_bytes_unchecked | unsafe | fast| what is currently called
`substring` - user should make sure valid utf8 substrings can be gotten with
`start` and `length` | fast |
| substring_bytes | safe | slow | Implement substring by bytes -
throws error of invalid utf-8 sequence is created
| substring | safe | slow | Implement substring by char,
ensuring all substrings are utf-8
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]