alamb commented on code in PR #9848:
URL: https://github.com/apache/arrow-rs/pull/9848#discussion_r3290935306
##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -230,12 +234,17 @@ impl<'a> ArrayReaderBuilder<'a> {
&self,
field: &ParquetField,
Review Comment:
Since we have to change all the callsites anyways to plumb a new argument
down, what do you think about making a struct to make it easier in the future
(and give us a clear place to document what `padding_threshold` means?
It might also make it easier to find the callsite that padding_threshold was
added
Something like
```rust
struct ReaderArgs<'a> {
/// What field the output corresponds to
field: &ParquetField,
/// Which columns are being read
mask: &ProjectionMask,
/// To be documented here
padding_threshold: Option<i16>,
}
```
If you like this idea, we could try and do it as a preparatory / mechanical
PR and then this one would be a lot smaller
Then instead of
```rust
self.build_list_reader(field, mask, padding_threshold)
```
It woudl end up like
```rust
self.build_list_reader(reader_args)
```
##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -113,9 +113,7 @@ impl<'a> ArrayReaderBuilder<'a> {
}
}
- /// Set the batch size used to pre-allocate internal buffers.
- ///
- /// This avoids reallocations when reading the first batch of data.
Review Comment:
any reason to remove this comment?
##########
parquet/src/arrow/array_reader/byte_view_array.rs:
##########
@@ -36,14 +36,12 @@ use bytes::Bytes;
use std::any::Any;
/// Returns an [`ArrayReader`] that decodes the provided byte array column to
view types.
-///
-/// `batch_size` is used to pre-allocate internal buffers,
Review Comment:
why remove the comments?
##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -230,12 +234,17 @@ impl<'a> ArrayReaderBuilder<'a> {
&self,
field: &ParquetField,
Review Comment:
maybe also it needs `batch_size`
##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -230,12 +234,17 @@ impl<'a> ArrayReaderBuilder<'a> {
&self,
field: &ParquetField,
mask: &ProjectionMask,
+ padding_threshold: Option<i16>,
) -> Result<Option<Box<dyn ArrayReader>>> {
let children = field.children().unwrap();
assert_eq!(children.len(), 2);
- let key_reader = self.build_reader(&children[0], mask)?;
- let value_reader = self.build_reader(&children[1], mask)?;
+ // The map is internally List(Struct(key, value)). The key/value
+ // readers are children of the struct inside the list, so their
+ // padding threshold is the map's (= list's) def_level.
+ let child_threshold = Some(field.def_level);
Review Comment:
With a `ReaderArgs` structure I can imagine this looks like
```rust
fn build_map_reader(args: ReaderArgs<'_>) -> Result<Option<Box<dyn
ArrayReader>>> {
...
// The map is internally List(Struct(key, value)). The key/value
// readers are children of the struct inside the list, so their
// padding threshold is the map's (= list's) def_level.
let args = args.with_padding_threshold(Some(field.def_level));
let key_reader = self.build_reader(args.with_field(&children[0]));
let value_reader = self.build_reader(args.with_field(&children[1]));
```
Though maybe this is too magical 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]