scovich commented on code in PR #8438:
URL: https://github.com/apache/arrow-rs/pull/8438#discussion_r2376605061
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -353,37 +352,20 @@ impl VariantArray {
/// Note: Does not do deep validation of the [`Variant`], so it is up to
the
/// caller to ensure that the metadata and value were constructed
correctly.
pub fn value(&self, index: usize) -> Variant<'_, '_> {
- match &self.shredding_state {
Review Comment:
There was already substantial logic duplication among the different match
arms, and it only got worse once `typed_value_to_variant` started requiring the
value column (needed for both error checking now, and later when handling
partially shredded objects). It turned out that directly referencing the two
fields was a _lot_ simpler.
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -809,17 +800,11 @@ fn typed_value_to_variant(typed_value: &ArrayRef, index:
usize) -> Variant<'_, '
let date = Date32Type::to_naive_date(value);
Variant::from(date)
}
- DataType::FixedSizeBinary(binary_len) => {
+ // 16-byte FixedSizeBinary is always UUID; all others illegal.
+ DataType::FixedSizeBinary(16) => {
let array = typed_value.as_fixed_size_binary();
- // Try to treat 16 byte FixedSizeBinary as UUID
- let value = array.value(index);
- if *binary_len == 16 {
- if let Ok(uuid) = Uuid::from_slice(value) {
- return Variant::from(uuid);
- }
- }
let value = array.value(index);
- Variant::from(value)
+ Uuid::from_slice(value).map_or(Variant::Null, Variant::from)
Review Comment:
Falling back to `Variant::Null` is not really correct, but follows much of
the existing convention in this code.
If/when we make `VariantArray::value` fallible, this code should propagate
the error on failure.
Alternatively, we could just `unwrap`, "knowing" that
[Uuid::from_slice](https://docs.rs/uuid/latest/uuid/struct.Uuid.html#method.from_slice)
can "only" fail if given a wrong slice size (which in theory should never
happen because we know we have an array of 16-byte fixed binary values)?
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -796,8 +778,17 @@ impl StructArrayBuilder {
}
/// returns the non-null element at index as a Variant
-fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant<'_,
'_> {
- match typed_value.data_type() {
+fn typed_value_to_variant<'a>(
+ typed_value: &'a ArrayRef,
+ value: Option<&BinaryViewArray>,
+ index: usize,
+) -> Variant<'a, 'a> {
+ let data_type = typed_value.data_type();
+ if value.is_some_and(|v| !matches!(data_type, DataType::Struct(_)) &&
v.is_valid(index)) {
+ // Only a partially shredded struct is allowed to have values for both
columns
+ panic!("Invalid variant, conflicting value and typed_value");
Review Comment:
This whole panic thing is becoming increasingly awkward as more and more
valid error cases arise. Especially because:
1. Variant data is untrusted (coming in from the user), so we have to expect
malformed data
2. All prod uses of `VariantArray::value` are in fallible code that _could_
return an error, if given the opportunity.
Now that `VariantArray` no longer implements `Array`, we have the option to
make `value` fallible (or add a fallible `try_value` if we _really_ want to
keep the panicky version).
##########
parquet/tests/variant_integration.rs:
##########
@@ -197,6 +194,7 @@ variant_test_case!(84, "Unsupported typed_value type:
Struct(");
variant_test_case!(85, "Unsupported typed_value type: List(");
variant_test_case!(86, "Unsupported typed_value type: List(");
// Is an error case (should be failing as the expected error message indicates)
+// TODO: Once structs are supported, expect "Invalid variant, non-object value
with shredded fields"
variant_test_case!(87, "Unsupported typed_value type: Struct(");
Review Comment:
This is an invalid-case test, but the lack of struct support currently masks
the real problem.
(another below)
##########
parquet/tests/variant_integration.rs:
##########
@@ -144,10 +144,7 @@ variant_test_case!(39);
variant_test_case!(40, "Unsupported typed_value type: List(");
variant_test_case!(41, "Unsupported typed_value type: List(");
// Is an error case (should be failing as the expected error message indicates)
-variant_test_case!(
- 42,
- "Expected an error 'Invalid variant, conflicting value and typed_value`,
but got no error"
Review Comment:
This was just flat out wrong, swallowing the error message that correctly
identifies a problem 🤦
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -898,6 +871,14 @@ fn cast_to_binary_view_arrays(array: &dyn Array) ->
Result<ArrayRef, ArrowError>
/// replaces all instances of Binary with BinaryView in a DataType
fn rewrite_to_view_types(data_type: &DataType) -> DataType {
Review Comment:
If we agree this is the right place for the checks, I should probably rename
the function (and make it fallible)?
And also expand it to cover the exhaustive set of valid and invalid data
types so there's no confusion about what's legal and what's forbidden. This can
be done immediately, even if a given "valid" data type isn't yet supported --
the read will simply fail later on in such cases (exactly the same as already
happens today).
##########
parquet/tests/variant_integration.rs:
##########
@@ -238,21 +236,19 @@ variant_test_case!(124);
variant_test_case!(125, "Unsupported typed_value type: Struct");
variant_test_case!(126, "Unsupported typed_value type: List(");
// Is an error case (should be failing as the expected error message indicates)
-variant_test_case!(
- 127,
- "Invalid variant data: InvalidArgumentError(\"Received empty bytes\")"
Review Comment:
AFAICT, the test data has invalid empty `""` metadata column entries,
perhaps because the data is manually generated and the author never expected
readers to get beyond the schema checks 🤷
##########
parquet/tests/variant_integration.rs:
##########
@@ -238,21 +236,19 @@ variant_test_case!(124);
variant_test_case!(125, "Unsupported typed_value type: Struct");
variant_test_case!(126, "Unsupported typed_value type: List(");
// Is an error case (should be failing as the expected error message indicates)
-variant_test_case!(
- 127,
- "Invalid variant data: InvalidArgumentError(\"Received empty bytes\")"
-);
+variant_test_case!(127, "Illegal shredded value type: UInt32");
// Is an error case (should be failing as the expected error message indicates)
+// TODO: Once structs are supported, expect "Invalid variant, non-object value
with shredded fields"
variant_test_case!(128, "Unsupported typed_value type: Struct(");
-variant_test_case!(129, "Invalid variant data: InvalidArgumentError(");
Review Comment:
This test verifies an invalid input (value and typed_value both NULL), which
the [shredding
spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding)
mandates should produce `Variant::Null`:
> If a Variant is missing in a context where a value is required, readers
must return a Variant null (00): basic type 0 (primitive) and physical type 0
(null).
The parquet footer for this test is:
```
Row group 0: count: 1 123.00 B records start: 4 total(compressed): 123 B
total(uncompressed):123 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
id INT32 _ _ 1 27.00 B 0 "1" / "1"
var.metadata BINARY _ _ 1 36.00 B 0 "0x010000"
/ "0x010000"
var.value BINARY _ _ 1 30.00 B 1
var.typed_value INT32 _ _ 1 30.00 B 1
```
##########
parquet/tests/variant_integration.rs:
##########
@@ -238,21 +236,19 @@ variant_test_case!(124);
variant_test_case!(125, "Unsupported typed_value type: Struct");
variant_test_case!(126, "Unsupported typed_value type: List(");
// Is an error case (should be failing as the expected error message indicates)
-variant_test_case!(
- 127,
- "Invalid variant data: InvalidArgumentError(\"Received empty bytes\")"
-);
+variant_test_case!(127, "Illegal shredded value type: UInt32");
// Is an error case (should be failing as the expected error message indicates)
+// TODO: Once structs are supported, expect "Invalid variant, non-object value
with shredded fields"
variant_test_case!(128, "Unsupported typed_value type: Struct(");
-variant_test_case!(129, "Invalid variant data: InvalidArgumentError(");
+variant_test_case!(129);
variant_test_case!(130, "Unsupported typed_value type: Struct(");
variant_test_case!(131);
variant_test_case!(132, "Unsupported typed_value type: Struct(");
variant_test_case!(133, "Unsupported typed_value type: Struct(");
variant_test_case!(134, "Unsupported typed_value type: Struct(");
variant_test_case!(135);
variant_test_case!(136, "Unsupported typed_value type: List(");
-variant_test_case!(137, "Invalid variant data: InvalidArgumentError(");
Review Comment:
I'm not sure why this one failed before, but it was apparently for the wrong
reason.
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -438,26 +438,6 @@ mod test {
numeric_partially_shredded_test!(i64,
partially_shredded_int64_variant_array);
}
- #[test]
- fn get_variant_partially_shredded_uint8_as_variant() {
Review Comment:
I'm not sure how exhaustive we want to be about negative testing as a
replacement for all these unit tests I deleted?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]