Yordan Pavlov created ARROW-11799:
-------------------------------------
Summary: [Rust] String and Binary arrays created with incorrect
length from unbound iterator
Key: ARROW-11799
URL: https://issues.apache.org/jira/browse/ARROW-11799
Project: Apache Arrow
Issue Type: Improvement
Components: Rust
Affects Versions: 3.0.0
Reporter: Yordan Pavlov
Assignee: Yordan Pavlov
While looking for a way to make loading array data from parquet files faster, I
stumbled on an edge case where string and binary arrays are created with an
incorrect length from an iterator with no upper bound.
Here is a simple example:
```
// iterator that doesn't declare (upper) size bound
let string_iter = (0..).scan(0usize, |pos, i| {
if *pos < 10 {
*pos += 1;
Some(Some(format!("value {}", i)))
}
else {
// actually returns up to 10 values
None
}
})
// limited using take()
.take(100);
let (lower_size_bound, upper_size_bound) = string_iter.size_hint();
assert_eq!(lower_size_bound, 0);
// the upper bound, defined by take above, is 100
assert_eq!(upper_size_bound, Some(100));
let string_array: StringArray = string_iter.collect();
// but the actual number of items in the array is 10
assert_eq!(string_array.len(), 10);
```
Fortunately this is easy to fix by using the length of the child offset array
and I will be creating a PR for this shortly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)