[GitHub] [arrow-rs] alamb commented on a diff in pull request #1529: update the doc of `substring`

GitBox Sun, 10 Apr 2022 04:15:49 -0700


alamb commented on code in PR #1529:
URL: https://github.com/apache/arrow-rs/pull/1529#discussion_r846762535



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -94,8 +94,33 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     Ok(make_array(data))
 }
 
-/// Returns an ArrayRef with a substring starting from `start` and with 
optional length `length` of each of the elements in `array`.
-/// `start` can be negative, in which case the start counts from the end of 
the string.
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of 
the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Warning
+///
+/// This function **might** return in invalid utf-8 format if strings contain 
chars whose codepoint > `0x007F`.
+/// ## Example of getting an invalid substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let result = substring(&array, -1, &None).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format

Review Comment:
   @HaoYang670  I think these are all good ideas (making safe / unsafe 
versions, and adding substring for binary etc). I would be happy to create 
tickets for them if that would help
   
   
   If we are going to change the substring kernel, I  suggest we do the 
following to follow Rust's safe-by-defauly mantra:
   
   | name         | Safety     | Performance | Notes |
   |--------------|-----------|------------|---------|
   | unsafe substring_bytes_unchecked | unsafe | fast| what is currently called 
`substring` - user should make sure valid utf8 substrings can be gotten with 
`start` and `length`       | fast        |
   | substring_bytes    | safe  | slow       | Implement substring by bytes - 
throws error of invalid utf-8 sequence is created 
   | substring      | safe  | slow       | Implement substring by char, 
ensuring all substrings are utf-8



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -94,8 +94,33 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     Ok(make_array(data))
 }
 
-/// Returns an ArrayRef with a substring starting from `start` and with 
optional length `length` of each of the elements in `array`.
-/// `start` can be negative, in which case the start counts from the end of 
the string.
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of 
the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Warning
+///
+/// This function **might** return in invalid utf-8 format if strings contain 
chars whose codepoint > `0x007F`.
+/// ## Example of getting an invalid substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let result = substring(&array, -1, &None).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format

Review Comment:
   @HaoYang670  I think these are all good ideas (making safe / unsafe 
versions, and adding substring for binary etc). I would be happy to create 
tickets for them if that would help
   
   
   If we are going to change the substring kernel, I  suggest we do the 
following to follow Rust's safe-by-default mantra:
   
   | name         | Safety     | Performance | Notes |
   |--------------|-----------|------------|---------|
   | unsafe substring_bytes_unchecked | unsafe | fast| what is currently called 
`substring` - user should make sure valid utf8 substrings can be gotten with 
`start` and `length`       | fast        |
   | substring_bytes    | safe  | slow       | Implement substring by bytes - 
throws error of invalid utf-8 sequence is created 
   | substring      | safe  | slow       | Implement substring by char, 
ensuring all substrings are utf-8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on a diff in pull request #1529: update the doc of `substring`

Reply via email to