alamb opened a new pull request, #13293:
URL: https://github.com/apache/datafusion/pull/13293

   
   ## Which issue does this PR close?
   
   - Closes https://github.com/apache/datafusion/issues/8078
   - Related to https://github.com/apache/datafusion/issues/10316
   
   # Discussion:
   This is a Request for Comment (and maybe also a POC)
   
   I am going to hack out another PR that takes a slightly different approach ( 
`Precision::Interval`) and see how much better/worse that is. 
   
   ## Rationale for this change
   
   For the analysis described on 
https://github.com/apache/datafusion/issues/10316, we need to know if a value 
is constrained to a range to avoid merging. However the current `Statistics` 
are either `Exact` or `Inexact`, so once the precision becomes Inexact we lose 
the information that the possible minimum / maximum values do not change.
   
   This came up twice for me recently:
   1. @suremarc, @findepi and I spoke about this here: 
https://github.com/apache/datafusion/issues/8227#issuecomment-2458134570 where 
we need to know if a value is constrained to a range to avoid merging (see 
https://github.com/apache/datafusion/issues/10316)
   2. Yesterday I happened to be working on code in InfluxDB 3.0 that relies on 
knowing min/max values and I hit 
https://github.com/influxdata/arrow-datafusion/commit/30d4368127db80c125a1adea22a955a496a516df
 which marks the statistics as "inexact" when filter pushdown is on for 
parquet, but that loses the key information that the possible minimum / maximum 
values do not change
   
   I hacked around it downstream, but I think this is all sounding like it is 
time to add a new `Precision` enum that allows for this usecase
   
   
   ## What changes are included in this PR?
   - [x] Introduce `Precision::AtMost`, `Precision::AtLeast`
   - [x] Add `ColumnStatistics::min` and `ColumnStatistics::max` that return 
the correct value
   - [ ] Update code to handle new precision
   - [ ] Update ParquetExec to use the new Precision variants when filters are 
on
   
   ## Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   ## Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to