[GitHub] [arrow-datafusion] rdettai opened a new issue #997: Improve statistics (umbrella issue)

GitBox Mon, 13 Sep 2021 01:56:23 -0700


rdettai opened a new issue #997:
URL: https://github.com/apache/arrow-datafusion/issues/997



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   This is an umbrella issue to gather all improvements regarding statistics.
   
   **Describe the solution you'd like**
   - [ ] #962
   - [ ] #992 
   - [ ] remove `total_byte_size` as we are not using it
   - [ ] replace the `is_exact` field at the `Statistics` level with per-field 
information
   - [ ] have more granularity in statistics that just `(value, is_exact)`: 
possible solutions are histograms (cf [Spark 
CBOs](https://issues.apache.org/jira/browse/SPARK-16026))
   
   **Additional context**
   Statistics are usually sourced at the datasource level, then propagated 
through the plan tree according to the types of nodes. They are used to choose 
between different logically equivalent plans or plan configurations. The more 
rules are implemented for propagating the statistics, the more information the 
optimizer will have to take good decisions. But at the same time, an overly 
complex abstraction that is not used by any optimization rule would bloat the 
code base and make it harder to maintain. For that reason, extensions of the 
statistics system should be driven by the addition of concrete optimization 
rules that require them.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] rdettai opened a new issue #997: Improve statistics (umbrella issue)

Reply via email to