etseidl commented on issue #15812: URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2821942159
> I'm not immediately sure. Is the point that the result of `max(2.0, NaN)` depends on how you define the ordering of floating point numbers wrt NaN, which has two variations? Yes. Different systems treat `NaN` differently, but IIUC datafusion uses total order for floating point comparison, so (loosely) `-NaN < -Inf < -x < -0 < 0 < x < Inf < NaN`. This ordering is being proposed for Parquet as well. > If so the simplest short term solution would be to not write stats for containers that have NaN. At least results would then be correct. Yes, and I believe that's what parquet-java might already do. But many writers do write stats in this case, which leads to the usual backwards compatibility issues. So in my mind the ultimate solution is check `ColumnOrder` for the predicate column. If it uses the new `IEEE_754_TOTAL_ORDER` ordering, then proceed as usual. If it uses `TYPE_DEFINED_ORDER`, then we know `NaN` may be present but not accounted for, so don't do any pruning. The problem in my head is how to get the `ColumnOrder` info down into the plan generation code...or do we do extra plan rewriting at the parquet layer? I'm too new with datafusion to have a feel for where the correct place to handle this is. > How do we handle this with nulls 🤔 Different can of worms 😄. I'm not sure what parquet-rs does with a column of `NaN` mixed with `null`. Guess I'll go see... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org