thomas-k-cameron opened a new issue, #2976:
URL: https://github.com/apache/arrow-datafusion/issues/2976
**Describe the bug**
Some fields of `Statistics` could return `Some(0)` when the value is not
available.
TODO comment says that some fields are supposed to be `None` instead of
`Some(0)` under some circumstances but the value is set to `Some(0)` even when
it is supposed to have `None`.
```rust
/// Get all files as well as the file level summary statistics (no statistic
for partition columns).
/// If the optional `limit` is provided, includes only sufficient files.
/// Needed to read up to `limit` number of rows.
/// TODO fix case where `num_rows` and `total_byte_size` are not defined
(stat should be None instead of Some(0))
pub async fn get_statistics_with_limit(
all_files: impl Stream<Item = Result<(PartitionedFile, Statistics)>>,
file_schema: SchemaRef,
limit: Option<usize>,
) -> Result<(Vec<PartitionedFile>, Statistics)>
```
**To Reproduce**
I haven't ran into any breaking bugs yet, however, the functions is used in
a method provided by one of the method implemented on `ListingTable` struct and
`Some(0)` is returned when the `options.collect_stat` is set to `false` (aka
when it is not supposed to be available).
And it appears that it is generating
`Statistics` is passed to a method implemented on `Arc<dyn FileFormat>` so
chance of someone running into a bug is not zero.
Also, `pub async fn get_statistics_with_limit` is a exported function.
**Expected behavior**
returns `None` instead of `Some(0)`.
**Alternatives Considered**
1. Turn `Some(0)` to `None`.
However, this solution would be rather labour intensive as the field is
being used by everywhere (over 200 occurrence)
Also anyone using it cannot determine is the size is 0 or the value is just
not available
2. Assign `None` in ListingTable::async fn list_files_for_scan<'a>
`ListingTable::async fn list_files_for_scan<'a>` is the only function
calling `get_statistics_with_limit`. It can be fixed by assigning `None` to
statistics.
However, `get_statistics_with_limit` is being exported so I didn't think it
really solves the problem.
3. Return a number that was available
The `Statistics` struct is basically an aggregate of all `Statistics` struct
passed onto the function.
Some values may have fields that is not `None`.
Considering that `Statistics` is not meant to always return a accurate
value, I think this could work too.
**Additional context**
Is it suppose to run get_statistics_with_limit when `options.collect_stat`
is set to `false`?
Thank you very much for reading my issue and please let me know if I'm
getting it wrong.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]