tustvold commented on a change in pull request #1444:
URL: https://github.com/apache/arrow-rs/pull/1444#discussion_r834509184
##########
File path: arrow/src/array/data.rs
##########
@@ -957,12 +955,16 @@ impl ArrayData {
let child = &self.child_data[0];
self.validate_offsets_full::<i64>(child.len + child.offset)?;
}
- DataType::Union(_, _) => {
- // Validate Union Array as part of implementing new Union
semantics
- // See comments in `ArrayData::validate()`
- // https://github.com/apache/arrow-rs/issues/85
- //
- // TODO file follow on ticket for full union validation
+ DataType::Union(_fields, mode) => {
+ match mode {
+ UnionMode::Sparse => {
+ // typeids should all be valid
+
self.validate_offsets_full::<i8>(self.child_data.len())?;
Review comment:
I don't think this is correct, despite what the method says it is
designed for validating list offsets and therefore also checks for monotonicity.
As an aside the spec doesn't seem very clear about if an offset can be
repeated...
##########
File path: arrow/src/array/data.rs
##########
@@ -957,12 +955,16 @@ impl ArrayData {
let child = &self.child_data[0];
self.validate_offsets_full::<i64>(child.len + child.offset)?;
}
- DataType::Union(_, _) => {
- // Validate Union Array as part of implementing new Union
semantics
- // See comments in `ArrayData::validate()`
- // https://github.com/apache/arrow-rs/issues/85
- //
- // TODO file follow on ticket for full union validation
+ DataType::Union(_fields, mode) => {
Review comment:
Unless I'm missing something, we should probably also add buffer length
checks into `ArrayData::validate` as I don't think these are currently present
anywhere
##########
File path: arrow/src/array/data.rs
##########
@@ -1117,6 +1119,44 @@ impl ArrayData {
)
}
+ /// Ensures that for each union element, the offset is correct for
+ /// the corresponding child array
+ fn validate_dense_union_full(&self) -> Result<()> {
Review comment:
I think should also check that offsets are monotonic for a given array
type, but that could definitely be left as a todo
##########
File path: arrow/src/array/data.rs
##########
@@ -1117,6 +1119,44 @@ impl ArrayData {
)
}
+ /// Ensures that for each union element, the offset is correct for
+ /// the corresponding child array
+ fn validate_dense_union_full(&self) -> Result<()> {
+ // safety justification is that the size of the buffers was validated
in self.validate()
Review comment:
We could potentially make `validate` also check that all child arrays
have the length of the parent in the case of a dense representation
##########
File path: arrow/src/array/data.rs
##########
@@ -957,12 +955,16 @@ impl ArrayData {
let child = &self.child_data[0];
self.validate_offsets_full::<i64>(child.len + child.offset)?;
}
- DataType::Union(_, _) => {
- // Validate Union Array as part of implementing new Union
semantics
- // See comments in `ArrayData::validate()`
- // https://github.com/apache/arrow-rs/issues/85
- //
- // TODO file follow on ticket for full union validation
+ DataType::Union(_fields, mode) => {
+ match mode {
+ UnionMode::Sparse => {
+ // typeids should all be valid
+
self.validate_offsets_full::<i8>(self.child_data.len())?;
+ }
+ UnionMode::Dense => {
+ self.validate_dense_union_full()?;
Review comment:
I was going to suggest that this should validate that the null bitmasks
are disjoint, but this may not even be a requirement - the specification says
"All “unselected” values are ignored and could be any semantically correct
array value."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]