[GitHub] [arrow-datafusion] rdettai commented on issue #1301: Assign stats per partitioned file

GitBox Tue, 16 Nov 2021 00:46:12 -0800


rdettai commented on issue #1301:
URL: 
https://github.com/apache/arrow-datafusion/issues/1301#issuecomment-970049109



   You are correct, we are not managing the per file statistics anywhere. 
Neither do we have per partition statistics. The only granularity at which we 
are managing statistics **(for now)** is at the `ExecutionPlan` node level:
   ```Rust
   pub trait ExecutionPlan: Debug + Send + Sync {
       ...
   
       /// Returns the global output statistics for this `ExecutionPlan` node.
       fn statistics(&self) -> Statistics;
   }
   ```
   
   This means that there is no point in providing higher granularity statistics 
to the `ExecutionPlan`, it wouldn't know what to do with it 😄. The idea is that 
anything you want to do do with your higher level statistics (pruning,...), you 
do it at the `TableProvider` level, and then you give the `ExecutionPlan` only 
the information it needs, that it to say the file statistics aggregated 
together (you can use `get_statistics_with_limit` to do that).
   
   When (if) we decide that we need higher granularity statistics in the 
`ExecutionPlan`, we will need to provide the statistics differently. But would 
that be the file level? I am not sure. It seems to me that the 
"execution-plan-partition" level will be more appropriate. I guess that the 
"execution-plan-partition" could be a file, multiple files, or a chunk of a 
file. This means that statistic would be attached to another struct, that would 
look like:
   ```Rust
   struct PhysicalPlanFilePartition {
     file_group: Vec<PartitionedFile>,
     statistics: Statistics,
   }
   ```
   
   But again, I really think that this abstraction is worth being introduced 
only when (if) we decide that we need higher granularity statistics in the 
`ExecutionPlan`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] rdettai commented on issue #1301: Assign stats per partitioned file

Reply via email to