Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]

via GitHub Tue, 22 Apr 2025 09:59:38 -0700


etseidl commented on issue #15812:
URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2821942159


   > I'm not immediately sure. Is the point that the result of `max(2.0, NaN)` 
depends on how you define the ordering of floating point numbers wrt NaN, which 
has two variations?
   
   Yes. Different systems treat `NaN` differently, but IIUC datafusion uses 
total order for floating point comparison, so (loosely) `-NaN < -Inf < -x < -0 
< 0 < x < Inf < NaN`. This ordering is being proposed for Parquet as well.
   
   > If so the simplest short term solution would be to not write stats for 
containers that have NaN. At least results would then be correct.
   
   Yes, and I believe that's what parquet-java might already do. But many 
writers do write stats in this case, which leads to the usual backwards 
compatibility issues. So in my mind the ultimate solution is check 
`ColumnOrder` for the predicate column.  If it uses the new 
`IEEE_754_TOTAL_ORDER` ordering, then proceed  as usual. If it uses 
`TYPE_DEFINED_ORDER`, then we know `NaN` may be present but not accounted for, 
so don't do any pruning. The problem in my head is how to get the `ColumnOrder` 
info down into the plan generation code...or do we do extra plan rewriting at 
the parquet layer? I'm too new with datafusion to have a feel for where the 
correct place to handle this is.
   
   > How do we handle this with nulls 🤔
   
   Different can of worms 😄. I'm not sure what parquet-rs does with a column of 
`NaN` mixed with `null`. Guess I'll go see...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]

Reply via email to