alamb commented on code in PR #9497:
URL: https://github.com/apache/arrow-rs/pull/9497#discussion_r3017751948
##########
arrow-json/src/reader/mod.rs:
##########
@@ -819,7 +819,7 @@ mod tests {
use arrow_array::cast::AsArray;
use arrow_array::{
Array, BooleanArray, Float64Array, GenericListViewArray, ListArray,
OffsetSizeTrait,
- StringArray, StringViewArray,
+ StringArray, StringViewArray, StructArray, make_array,
};
use arrow_buffer::{ArrowNativeType, Buffer};
use arrow_cast::display::{ArrayFormatter, FormatOptions};
Review Comment:
This PR has somehow broken decoding of nested REE arrays. It appears to be
related to the changes in `arrow-json/src/reader/run_end_array.rs`:
Here is a test case (I made a PR to add it):
https://github.com/apache/arrow-rs/pull/9634
```rust
#[test]
fn test_read_nested_run_end_encoded() {
let buf = r#"
{"a": "x"}
{"a": "x"}
{"a": "y"}
"#;
// The outer REE compresses whole rows, while the inner REE
compresses the
// repeated string values produced by decoding those rows.
let inner_type = DataType::RunEndEncoded(
Arc::new(Field::new("run_ends", DataType::Int64, false)),
Arc::new(Field::new("values", DataType::Utf8, true)),
);
let outer_type = DataType::RunEndEncoded(
Arc::new(Field::new("run_ends", DataType::Int64, false)),
Arc::new(Field::new("values", inner_type, true)),
);
let schema = Arc::new(Schema::new(vec![Field::new("a", outer_type,
true)]));
let batches = do_read(buf, 1024, false, false, schema);
assert_eq!(batches.len(), 1);
let col = batches[0].column(0);
let outer = col.as_run::<arrow_array::types::Int64Type>();
// Three logical rows compress to two outer runs: ["x", "x"] and
["y"].
assert_eq!(outer.len(), 3);
assert_eq!(outer.run_ends().values(), &[2, 3]);
let nested =
outer.values().as_run::<arrow_array::types::Int64Type>();
// The physical values of the outer REE are themselves a two-element
REE.
assert_eq!(nested.len(), 2);
assert_eq!(nested.run_ends().values(), &[1, 2]);
let nested_values = nested.values().as_string::<i32>();
assert_eq!(nested_values.len(), 2);
assert_eq!(nested_values.value(0), "x");
assert_eq!(nested_values.value(1), "y");
}
```
On main:
```
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test -p
arrow-json reader::tests::test_read_nested_run_end_encoded -- --nocapture
Blocking waiting for file lock on build directory
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.35s
Running unittests src/lib.rs
(target/debug/deps/arrow_json-713c99bb8504d9f8)
running 1 test
test reader::tests::test_read_nested_run_end_encoded ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 120 filtered
out; finished in 0.00s
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs
```
On this branch it panics
```rust
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test -p
arrow-json reader::tests::test_read_nested_run_end_encoded -- --nocapture
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.07s
Running unittests src/lib.rs
(target/debug/deps/arrow_json-dd7fe742fe543fef)
running 1 test
thread 'reader::tests::test_read_nested_run_end_encoded' (11584243) panicked
at arrow-ord/src/cmp.rs:259:18:
internal error: entered unreachable code
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test reader::tests::test_read_nested_run_end_encoded ... FAILED
failures:
failures:
reader::tests::test_read_nested_run_end_encoded
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 120 filtered
out; finished in 0.00s
error: test failed, to rerun pass `-p arrow-json --lib`
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$
```
##########
arrow-json/src/reader/list_array.rs:
##########
@@ -15,20 +15,23 @@
// specific language governing permissions and limitations
// under the License.
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+use arrow_array::builder::BooleanBufferBuilder;
+use arrow_array::{ArrayRef, GenericListArray, GenericListViewArray,
OffsetSizeTrait};
+use arrow_buffer::buffer::NullBuffer;
+use arrow_buffer::{OffsetBuffer, ScalarBuffer};
+use arrow_schema::{ArrowError, DataType, FieldRef};
+
use crate::reader::tape::{Tape, TapeElement};
use crate::reader::{ArrayDecoder, DecoderContext};
-use arrow_array::OffsetSizeTrait;
-use arrow_array::builder::BooleanBufferBuilder;
-use arrow_buffer::{Buffer, buffer::NullBuffer};
-use arrow_data::{ArrayData, ArrayDataBuilder};
-use arrow_schema::{ArrowError, DataType};
-use std::marker::PhantomData;
pub type ListArrayDecoder<O> = ListLikeArrayDecoder<O, false>;
pub type ListViewArrayDecoder<O> = ListLikeArrayDecoder<O, true>;
pub struct ListLikeArrayDecoder<O, const IS_VIEW: bool> {
- data_type: DataType,
+ field: FieldRef,
Review Comment:
this is a nice change (as this now clones an Arc rather than a DataType) 👍
##########
arrow-json/Cargo.toml:
##########
@@ -39,8 +39,9 @@ all-features = true
arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true }
-arrow-data = { workspace = true }
+arrow-ord = { workspace = true }
Review Comment:
Can you please pull these changes (I think they are related to rEE
optimizations) into a separate PR for review separately?
##########
arrow-json/src/reader/list_array.rs:
##########
@@ -91,34 +94,34 @@ impl<O: OffsetSizeTrait, const IS_VIEW: bool> ArrayDecoder
for ListLikeArrayDeco
}
let offset = O::from_usize(child_pos.len()).ok_or_else(|| {
- ArrowError::JsonError(format!("offset overflow decoding {}",
self.data_type))
+ ArrowError::JsonError(format!("offset overflow decoding
{}ListArray", O::PREFIX))
})?;
offsets.push(offset);
}
- let child_data = self.decoder.decode(tape, &child_pos)?;
+ let values = self.decoder.decode(tape, &child_pos)?;
let nulls = nulls.as_mut().map(|x| NullBuffer::new(x.finish()));
- let mut data = ArrayDataBuilder::new(self.data_type.clone())
- .len(pos.len())
- .nulls(nulls)
- .child_data(vec![child_data]);
-
if IS_VIEW {
let mut sizes = Vec::with_capacity(offsets.len() - 1);
for i in 1..offsets.len() {
sizes.push(offsets[i] - offsets[i - 1]);
}
offsets.pop();
- data = data
- .add_buffer(Buffer::from_vec(offsets))
- .add_buffer(Buffer::from_vec(sizes));
+ let array = GenericListViewArray::<O>::try_new(
Review Comment:
This now does full validation for LIstViewArray I think -- which is more
expensive than the previous creation via data.build_unchecked()
I can see two alternatives:
1. Add a `unsafe GenericListViewArray::new_unchecked` methods
2. Leave this code as is and create an ArrayRef directly from the ArrayData
I would selfishly suggest 2 (and then a follow on PR to add `unsafe
GenericListViewArray::new_unchecked` / rework this code) to make reviewing this
one eaiser on my
##########
arrow-json/src/reader/run_end_array.rs:
##########
@@ -56,64 +60,39 @@ impl<R: RunEndIndexType> RunEndEncodedArrayDecoder<R> {
}
impl<R: RunEndIndexType + Send> ArrayDecoder for RunEndEncodedArrayDecoder<R> {
- fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData,
ArrowError> {
+ fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayRef,
ArrowError> {
let len = pos.len();
if len == 0 {
- return Ok(new_empty_array(&self.data_type).to_data());
+ return Ok(new_empty_array(&self.data_type));
}
- let flat_data = self.decoder.decode(tape, pos)?;
+ let flat_array = self.decoder.decode(tape, pos)?;
Review Comment:
as pointed out, there is something wrong with this code -- I recommend
reverting the extra optimization from this PR and adding it as a follow on PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]