jecsand838 commented on code in PR #8254:
URL: https://github.com/apache/arrow-rs/pull/8254#discussion_r2312110172
##########
arrow-avro/src/reader/record.rs:
##########
@@ -590,10 +590,23 @@ impl Decoder {
)));
}
}
+ // Extract the value field nullability from the schema
+ let is_value_nullable = match map_field.data_type() {
+ DataType::Struct(fields) => fields
+ .iter()
+ .find(|f| f.name() == "value")
+ .map(|f| f.is_nullable())
+ .unwrap_or(false),
+ _ => true, // default to nullable
+ };
let entries_struct = StructArray::new(
Fields::from(vec![
Arc::new(ArrowField::new("key", DataType::Utf8,
false)),
- Arc::new(ArrowField::new("value",
val_arr.data_type().clone(), true)),
+ Arc::new(ArrowField::new(
+ "value",
+ val_arr.data_type().clone(),
+ is_value_nullable,
Review Comment:
I took a slightly different approach in #8220 that avoids the field scan and
new Field allocation while preserving the existing schema metadata.
It's slightly more performant, especially on smaller batches:
```
Map/100 time: [6.2237 µs 6.2594 µs 6.2899 µs]
thrpt: [507.32 MiB/s 509.79 MiB/s 512.71 MiB/s]
change:
time: [−0.3038% +0.5799% +1.4930%] (p = 0.19 >
0.05)
thrpt: [−1.4710% −0.5766% +0.3047%]
No change in performance detected.
Map/10000 time: [250.40 µs 253.75 µs 258.62 µs]
thrpt: [1.2573 GiB/s 1.2814 GiB/s 1.2986 GiB/s]
change:
time: [−2.5344% −1.1670% +0.1631%] (p = 0.08 >
0.05)
thrpt: [−0.1628% +1.1808% +2.6003%]
No change in performance detected.
Found 6 outliers among 25 measurements (24.00%)
6 (24.00%) low mild
Map/1000000 time: [252.99 µs 255.93 µs 260.24 µs]
thrpt: [130.21 GiB/s 132.40 GiB/s 133.94 GiB/s]
change:
time: [−1.9418% −0.4373% +1.0680%] (p = 0.60 >
0.05)
thrpt: [−1.0568% +0.4393% +1.9803%]
No change in performance detected.
```
vs
```
Map/100 time: [6.4487 µs 6.4584 µs 6.4687 µs]
thrpt: [493.30 MiB/s 494.09 MiB/s 494.83 MiB/s]
change:
time: [+4.2911% +5.0113% +5.7191%] (p = 0.00 <
0.05)
thrpt: [−5.4097% −4.7721% −4.1145%]
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
Map/10000 time: [258.90 µs 260.78 µs 263.17 µs]
thrpt: [1.2356 GiB/s 1.2469 GiB/s 1.2559 GiB/s]
change:
time: [−0.6514% +0.8318% +2.5189%] (p = 0.34 >
0.05)
thrpt: [−2.4570% −0.8249% +0.6557%]
No change in performance detected.
Found 1 outliers among 25 measurements (4.00%)
1 (4.00%) high severe
Map/1000000 time: [265.48 µs 268.25 µs 270.56 µs]
thrpt: [125.24 GiB/s 126.33 GiB/s 127.64 GiB/s]
change:
time: [+3.0081% +4.1202% +5.2124%] (p = 0.00 <
0.05)
thrpt: [−4.9542% −3.9572% −2.9203%]
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
```
If you wanted to check it out:
https://github.com/apache/arrow-rs/blob/ebf402915511308201ef5fbb92368d696ee50ff5/arrow-avro/src/reader/record.rs#L685
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]