Re: [PR] Refactor Statistics [arrow-datafusion]

via GitHub Mon, 16 Oct 2023 03:38:06 -0700


crepererum commented on PR #7793:
URL: 
https://github.com/apache/arrow-datafusion/pull/7793#issuecomment-1764191447

I like the idea of "precision", however I have a few concerns regarding the
current implementation:

- **inexact:** What does "inexact" provide if you don't have ANY bounds
whatsoever? I think this only makes sense if you can specify if the "inexact"
guess is a at least also an upper/lower bound of the possible values. E.g. "row
counts" can be an inexact upper bound if you have a filter but you just pass
through the counts of the infiltered data. So I propose
- **constraints:** If you have min+max value, I think you rarely want a
totally random guess but they probably should be ends of a range and constraint
each other. Also see point above.

With that, I propose to change the `Precision` enum to:

- `Exact`: speaks for itself, might be rare for floats though
- `UpperBound`: The actual value is `<=` this given bound.
- `LowerBound`: The actual value is `>=` this given bound.
- `Absent`: speaks for itself

For min/max value this probably leads to `min=LowerBound(...)` and
`max=UpperBound(...)` in many cases which also nicely fits into the interval
analysis that also exists in the DF code base already.

A more drastic (but maybe better?) alternative is to keep `Inexact` but
instead of filling it with `T`, fill it with
[`Interval`](https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/struct.Interval.html).
This provides you bounds and we already have a bunch of code around that type.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Refactor Statistics [arrow-datafusion]

Reply via email to