Re: [PR] feat: support reading and writing`StringView` and `BinaryView` in parquet [arrow-rs]

via GitHub Tue, 02 Apr 2024 13:43:36 -0700


alamb commented on code in PR #5557:
URL: https://github.com/apache/arrow-rs/pull/5557#discussion_r1548564637



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -400,7 +412,23 @@ where
 /// ```
 pub type BinaryViewArray = GenericByteViewArray<BinaryViewType>;
 
-/// A [`GenericByteViewArray`] that stores uf8 data
+impl BinaryViewArray {
+    /// Convert the [`BinaryViewArray`] to [`StringViewArray`]
+    /// If items not utf8 data, validate will fail and error returned.
+    pub fn to_stringview(self) -> Result<StringViewArray, ArrowError> {

Review Comment:
   i wonder if `to_string_view` would be more consistent with snake case for 
function names (likewise for `to_string_view_unchecked`



##########
arrow/src/util/bench_util.rs:
##########
@@ -119,6 +119,78 @@ pub fn create_string_array_with_len<Offset: 
OffsetSizeTrait>(
         .collect()
 }
 
+/// Creates a random (but fixed-seeded) array of a given size, null density 
and length
+pub fn create_string_view_array_with_len(
+    size: usize,
+    null_density: f32,
+    str_len: usize,
+    mixed: bool,
+) -> StringViewArray {
+    let rng = &mut seedable_rng();
+
+    let mut lengths = Vec::with_capacity(size);
+
+    // if mixed, we creates first half that string length small than 12 bytes 
and second half large than 12 bytes
+    if mixed {
+        for _ in 0..size / 2 {
+            lengths.push(rng.gen_range(1..12));
+        }
+        for _ in size / 2..size {
+            lengths.push(rng.gen_range(12..=std::cmp::max(30, str_len)));
+        }
+    } else {
+        lengths.resize(size, str_len);
+    }
+
+    lengths
+        .into_iter()
+        .map(|len| {
+            if rng.gen::<f32>() < null_density {
+                None
+            } else {
+                let value: Vec<u8> = 
rng.sample_iter(&Alphanumeric).take(len).collect();
+                Some(String::from_utf8(value).unwrap())
+            }
+        })
+        .collect()
+}
+
+/// Creates a random (but fixed-seeded) array of a given size, null density 
and length
+pub fn create_binary_view_array_with_len(
+    size: usize,
+    null_density: f32,
+    str_len: usize,
+    mixed: bool,
+) -> BinaryViewArray {
+    let rng = &mut seedable_rng();
+
+    let mut lengths = Vec::with_capacity(size);
+
+    // if mixed, we creates first half that string length small than 12 bytes 
and second half large than 12 bytes
+    if mixed {
+        for _ in 0..size / 2 {
+            lengths.push(rng.gen_range(1..12));
+        }
+        for _ in size / 2..size {
+            lengths.push(rng.gen_range(12..=std::cmp::max(30, str_len)));
+        }
+    } else {
+        lengths.resize(size, str_len);
+    }
+
+    lengths
+        .into_iter()
+        .map(|len| {
+            if rng.gen::<f32>() < null_density {
+                None
+            } else {
+                let value: Vec<u8> = 
rng.sample_iter(&Alphanumeric).take(len).collect();

Review Comment:
   Since this seems to be copy/paste from `create_string_view_array_with_len` 
maybe we could avoid the duplication by doing 
`create_string_view_array_with_len().to_binary_view()` or something? 



##########
parquet/benches/arrow_writer.rs:
##########
@@ -96,6 +96,24 @@ fn create_string_bench_batch(
     )?)
 }
 
+fn create_string_view_bench_batch(

Review Comment:
   I didn't see that this tested both BinaryView and Utf8View at first (the 
name caught me up)
   
   ```suggestion
   fn create_string_and_binary_view_bench_batch(
   ```



##########
parquet/src/arrow/buffer/offset_buffer.rs:
##########
@@ -125,18 +128,50 @@ impl<I: OffsetSizeTrait> OffsetBuffer<I> {
 
     /// Converts this into an [`ArrayRef`] with the provided `data_type` and 
`null_buffer`
     pub fn into_array(self, null_buffer: Option<Buffer>, data_type: ArrowType) 
-> ArrayRef {
-        let array_data_builder = ArrayDataBuilder::new(data_type)
-            .len(self.len())
-            .add_buffer(Buffer::from_vec(self.offsets))
-            .add_buffer(Buffer::from_vec(self.values))
-            .null_bit_buffer(null_buffer);
-
-        let data = match cfg!(debug_assertions) {
-            true => array_data_builder.build().unwrap(),
-            false => unsafe { array_data_builder.build_unchecked() },
-        };
-
-        make_array(data)
+        match data_type {
+            ArrowType::Utf8View => {
+                let mut builder = self.build_generic_byte_view();
+                Arc::new(builder.finish().to_stringview().unwrap())
+            }
+            ArrowType::BinaryView => {
+                let mut builder = self.build_generic_byte_view();
+                Arc::new(builder.finish())
+            }
+            _ => {
+                let array_data_builder = ArrayDataBuilder::new(data_type)
+                    .len(self.len())
+                    .add_buffer(Buffer::from_vec(self.offsets))
+                    .add_buffer(Buffer::from_vec(self.values))
+                    .null_bit_buffer(null_buffer);
+
+                let data = match cfg!(debug_assertions) {
+                    true => array_data_builder.build().unwrap(),
+                    false => unsafe { array_data_builder.build_unchecked() },
+                };
+
+                make_array(data)
+            }
+        }
+    }
+
+    fn build_generic_byte_view(&self) -> 
GenericByteViewBuilder<BinaryViewType> {
+        let mut builder = 
GenericByteViewBuilder::<BinaryViewType>::with_capacity(self.len());
+        for i in self.offsets.windows(2) {
+            let start = i[0];
+            let end = i[1];
+            let b = unsafe {
+                std::slice::from_raw_parts(
+                    self.values.as_ptr().offset(start.to_isize().unwrap()),
+                    (end - start).to_usize().unwrap(),
+                )
+            };
+            if b.is_empty() {
+                builder.append_null();
+            } else {
+                builder.append_value(b);

Review Comment:
   FWIWI think `append_value` still copies the bytes. I ran the nice new 
benchmark from this branch and it shows that pretty clearly:
   
   ![Screenshot 2024-04-02 at 4 39 45 
PM](https://github.com/apache/arrow-rs/assets/490673/64515fd7-138c-4537-a4a8-677894d40973)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: support reading and writing`StringView` and `BinaryView` in parquet [arrow-rs]

Reply via email to