Re: [PR] Binary / row helpers [arrow-rs]

via GitHub Fri, 09 Aug 2024 09:29:35 -0700


bkirwi commented on code in PR #6096:
URL: https://github.com/apache/arrow-rs/pull/6096#discussion_r1711661836



##########
arrow-row/src/lib.rs:
##########
@@ -738,6 +738,23 @@ impl RowConverter {
         }
     }
 
+    /// Create a new [Rows] instance from the given binary data.
+    pub fn from_binary(&self, array: BinaryArray) -> Rows {

Review Comment:
   Yeah, that's my understanding! This would definitely be unsafe without the 
`validate_utf8` below... but with it, I believe this has the same safety 
properties as existing public API.



##########
arrow-row/src/lib.rs:
##########
@@ -868,6 +885,24 @@ impl Rows {
             + self.buffer.len()
             + self.offsets.len() * std::mem::size_of::<usize>()
     }
+
+    /// Create a [BinaryArray] from the [Rows] data without reallocating the
+    /// underlying bytes.
+    pub fn into_binary(self) -> BinaryArray {
+        assert!(
+            self.buffer.len() <= i32::MAX as usize,
+            "rows buffer too large"
+        );
+        let offsets_scalar = 
ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));

Review Comment:
   My belief is that this is guaranteed by the assert above (which asserts that 
the len is not larger than `i32::MAX`) and the existing offset invariant (which 
guarantees that all offsets are valid indices into the binary data). So a more 
expensive O(n) check seemed redundant.
   
   I'll go ahead and turn that assert into a `Result::Err`; let me know what 
you think about the other side of it!



##########
arrow-row/src/lib.rs:
##########
@@ -738,6 +738,23 @@ impl RowConverter {
         }
     }
 
+    /// Create a new [Rows] instance from the given binary data.

Review Comment:
   Sure, done.



##########
arrow-row/src/lib.rs:
##########
@@ -2304,6 +2344,24 @@ mod tests {
                 actual.to_data().validate_full().unwrap();
                 dictionary_eq(actual, expected)
             }
+
+            // Check that we can convert
+            let rows = rows.into_binary();
+            let parser = converter.parser();
+            let back = converter
+                .convert_rows(rows.iter().map(|b| parser.parse(b.expect("valid 
bytes"))))
+                .unwrap();
+            for (actual, expected) in back.iter().zip(&arrays) {
+                actual.to_data().validate_full().unwrap();
+                dictionary_eq(actual, expected)
+            }
+
+            let rows = converter.from_binary(rows);
+            let back = converter.convert_rows(&rows).unwrap();
+            for (actual, expected) in back.iter().zip(&arrays) {
+                actual.to_data().validate_full().unwrap();
+                dictionary_eq(actual, expected)
+            }

Review Comment:
   Sure! I notice there's just one existing test for the parser, for utf8 data; 
I've matched that and added a couple more tests for interesting cases.
   
   (This seems like great API surface to fuzz... but it's challenging to write 
a real fuzzer for, since panics are expected and miri is disabled for our 
existing fuzzer. May be interesting future work!)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Binary / row helpers [arrow-rs]

Reply via email to