tustvold commented on code in PR #1832:
URL: https://github.com/apache/arrow-rs/pull/1832#discussion_r894994725


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -182,24 +182,66 @@ pub fn substring_by_char<OffsetSize: OffsetSizeTrait>(
     start: i64,
     length: Option<u64>,
 ) -> Result<GenericStringArray<OffsetSize>> {
-    Ok(array
-        .iter()
-        .map(|val| {
-            val.map(|val| {
-                let char_count = val.chars().count();
-                let start = if start >= 0 {
-                    start.to_usize().unwrap().min(char_count)
-                } else {
-                    char_count - (-start).to_usize().unwrap().min(char_count)
-                };
-                let length = length.map_or(char_count - start, |length| {
-                    length.to_usize().unwrap().min(char_count - start)
-                });
+    let mut vals = BufferBuilder::<u8>::new(array.value_data().len());

Review Comment:
   Sorry I wasn't clear, the problem here is that the values buffer may be 
substantially larger necessary as a result of slicing.
   
   Say I have an array of 1000 strings, and I take a slice of length two, the 
values array is completely unchanged, but the offsets array is now length of 3, 
instead of 1001.
   
   By taking the delta between the first and last offset in the source array, 
you ensure you are now allocating capacity that you definitely don't need, even 
if length was usize::MAX or something.



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -182,24 +182,66 @@ pub fn substring_by_char<OffsetSize: OffsetSizeTrait>(
     start: i64,
     length: Option<u64>,
 ) -> Result<GenericStringArray<OffsetSize>> {
-    Ok(array
-        .iter()
-        .map(|val| {
-            val.map(|val| {
-                let char_count = val.chars().count();
-                let start = if start >= 0 {
-                    start.to_usize().unwrap().min(char_count)
-                } else {
-                    char_count - (-start).to_usize().unwrap().min(char_count)
-                };
-                let length = length.map_or(char_count - start, |length| {
-                    length.to_usize().unwrap().min(char_count - start)
-                });
+    let mut vals = BufferBuilder::<u8>::new(array.value_data().len());

Review Comment:
   Sorry I wasn't clear, the problem here is that the values buffer may be 
substantially larger than necessary as a result of slicing.
   
   Say I have an array of 1000 strings, and I take a slice of length two, the 
values array is completely unchanged, but the offsets array is now length of 3, 
instead of 1001.
   
   By taking the delta between the first and last offset in the source array, 
you ensure you are now allocating capacity that you definitely don't need, even 
if length was usize::MAX or something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to