tustvold commented on a change in pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#discussion_r834179677
##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -1564,6 +1601,25 @@ mod tests {
one_column_roundtrip(values, true, Some(SMALL_SIZE / 2));
}
+ #[test]
+ fn list_nested_nulls() {
Review comment:
This test was just something I wrote whilst exploring the problem space,
and I think is generally useful
##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -887,7 +886,7 @@ fn remove_indices(
Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
}
}
- ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len()))),
+ ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len() -
indices.len()))),
Review comment:
The original version of #1448 was this, which is correct, it was then
changed in a subsequent follow up
[commit](https://github.com/apache/arrow-rs/pull/1448/commits/acd8d47ddda45598b21b68cbd6c16849ae79e42e#diff-0d6bed48d78c5a2472b7680a8185cabdc0bd259d6484e184439ed7830060661fL888).
I'm not sure why...
FYI @viirya
##########
File path: parquet/src/arrow/levels.rs
##########
@@ -677,8 +676,9 @@ impl LevelInfo {
len: usize,
) -> (Vec<i64>, Vec<bool>) {
match array.data_type() {
- DataType::Null
- | DataType::Boolean
+ // A NullArray is entirely nulls, despite not containing a null
buffer
+ DataType::Null => ((0..=(len as i64)).collect(), vec![false; len]),
Review comment:
This is the fix for #1480
##########
File path: parquet/src/arrow/levels.rs
##########
@@ -200,9 +200,8 @@ impl LevelInfo {
);
match child_array.data_type() {
- // TODO: The behaviour of a <list<null>> is untested
- DataType::Null => vec![list_level],
- DataType::Boolean
+ DataType::Null
Review comment:
This is the fix for #1036
##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -225,11 +225,10 @@ where
/// Reads at most `batch_size` records into array.
fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
- let records_read =
- read_records(&mut self.record_reader, self.pages.as_mut(),
batch_size)?;
+ read_records(&mut self.record_reader, self.pages.as_mut(),
batch_size)?;
// convert to arrays
- let array = arrow::array::NullArray::new(records_read);
+ let array =
arrow::array::NullArray::new(self.record_reader.num_values());
Review comment:
This was a bug, the NullArray should have the value count number of
elements, i.e. the same number of elements as the definition/repetition levels
if any.
In the non-nested case the two quantities will be the same, but in the event
of a nested array a single record might correspond to 1 or more values.
This would then cause this line to fire -
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader.rs#L930
##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -1513,6 +1513,43 @@ mod tests {
required_and_optional::<LargeStringArray, _>(raw_strs);
}
+ #[test]
+ fn null_list_single_column() {
Review comment:
This is an adapted version of the test added by @novemberkilo in
https://github.com/apache/arrow-rs/pull/1477/files#diff-92fdd8b1e0a40e7072901fe74855da759d2d97f02c4f7224f4217468d545ecacR1276
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]