HaoYang670 commented on code in PR #1529:
URL: https://github.com/apache/arrow-rs/pull/1529#discussion_r846528223


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -94,8 +94,33 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     Ok(make_array(data))
 }
 
-/// Returns an ArrayRef with a substring starting from `start` and with 
optional length `length` of each of the elements in `array`.
-/// `start` can be negative, in which case the start counts from the end of 
the string.
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of 
the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Warning
+///
+/// This function **might** return in invalid utf-8 format if strings contain 
chars whose codepoint > `0x007F`.
+/// ## Example of getting an invalid substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let result = substring(&array, -1, &None).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format

Review Comment:
   By the way, is it necessary to mark `substring` unsafe? 
   Besides, we could implement a safe version of `substring` and also 
`substring_by_char`:
   | name         | Safety     | Performance |
   |--------------|-----------|------------|
   | unsafe substring_unckecked | user should make sure valid utf8 substrings 
can be gotten with `start` and `length`       | fast        |
   | substring_checked      | None  | slow       |
   | substring_by_char      | None  | slow       |
   
   **Users should always make sure all values of the input array are in valid 
utf-8 format.**
   
   
   This will improve the safety of the `substring` kernel, but break the public 
API. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to