bkirwi commented on code in PR #6096:
URL: https://github.com/apache/arrow-rs/pull/6096#discussion_r1711661836
##########
arrow-row/src/lib.rs:
##########
@@ -738,6 +738,23 @@ impl RowConverter {
}
}
+ /// Create a new [Rows] instance from the given binary data.
+ pub fn from_binary(&self, array: BinaryArray) -> Rows {
Review Comment:
Yeah, that's my understanding! This would definitely be unsafe without the
`validate_utf8` below... but with it, I believe this has the same safety
properties as existing public API.
##########
arrow-row/src/lib.rs:
##########
@@ -868,6 +885,24 @@ impl Rows {
+ self.buffer.len()
+ self.offsets.len() * std::mem::size_of::<usize>()
}
+
+ /// Create a [BinaryArray] from the [Rows] data without reallocating the
+ /// underlying bytes.
+ pub fn into_binary(self) -> BinaryArray {
+ assert!(
+ self.buffer.len() <= i32::MAX as usize,
+ "rows buffer too large"
+ );
+ let offsets_scalar =
ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));
Review Comment:
My belief is that this is guaranteed by the assert above (which asserts that
the len is not larger than `i32::MAX`) and the existing offset invariant (which
guarantees that all offsets are valid indices into the binary data). So a more
expensive O(n) check seemed redundant.
I'll go ahead and turn that assert into a `Result::Err`; let me know what
you think about the other side of it!
##########
arrow-row/src/lib.rs:
##########
@@ -738,6 +738,23 @@ impl RowConverter {
}
}
+ /// Create a new [Rows] instance from the given binary data.
Review Comment:
Sure, done.
##########
arrow-row/src/lib.rs:
##########
@@ -2304,6 +2344,24 @@ mod tests {
actual.to_data().validate_full().unwrap();
dictionary_eq(actual, expected)
}
+
+ // Check that we can convert
+ let rows = rows.into_binary();
+ let parser = converter.parser();
+ let back = converter
+ .convert_rows(rows.iter().map(|b| parser.parse(b.expect("valid
bytes"))))
+ .unwrap();
+ for (actual, expected) in back.iter().zip(&arrays) {
+ actual.to_data().validate_full().unwrap();
+ dictionary_eq(actual, expected)
+ }
+
+ let rows = converter.from_binary(rows);
+ let back = converter.convert_rows(&rows).unwrap();
+ for (actual, expected) in back.iter().zip(&arrays) {
+ actual.to_data().validate_full().unwrap();
+ dictionary_eq(actual, expected)
+ }
Review Comment:
Sure! I notice there's just one existing test for the parser, for utf8 data;
I've matched that and added a couple more tests for interesting cases.
(This seems like great API surface to fuzz... but it's challenging to write
a real fuzzer for, since panics are expected and miri is disabled for our
existing fuzzer. May be interesting future work!)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]