rdettai commented on issue #1301:
URL:
https://github.com/apache/arrow-datafusion/issues/1301#issuecomment-970049109
You are correct, we are not managing the per file statistics anywhere.
Neither do we have per partition statistics. The only granularity at which we
are managing statistics **(for now)** is at the `ExecutionPlan` node level:
```Rust
pub trait ExecutionPlan: Debug + Send + Sync {
...
/// Returns the global output statistics for this `ExecutionPlan` node.
fn statistics(&self) -> Statistics;
}
```
This means that there is no point in providing higher granularity statistics
to the `ExecutionPlan`, it wouldn't know what to do with it 😄. The idea is that
anything you want to do do with your higher level statistics (pruning,...), you
do it at the `TableProvider` level, and then you give the `ExecutionPlan` only
the information it needs, that it to say the file statistics aggregated
together (you can use `get_statistics_with_limit` to do that).
When (if) we decide that we need higher granularity statistics in the
`ExecutionPlan`, we will need to provide the statistics differently. But would
that be the file level? I am not sure. It seems to me that the
"execution-plan-partition" level will be more appropriate. I guess that the
"execution-plan-partition" could be a file, multiple files, or a chunk of a
file. This means that statistic would be attached to another struct, that would
look like:
```Rust
struct PhysicalPlanFilePartition {
file_group: Vec<PartitionedFile>,
statistics: Statistics,
}
```
But again, I really think that this abstraction is worth being introduced
only when (if) we decide that we need higher granularity statistics in the
`ExecutionPlan`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]