[GitHub] [parquet-format] tustvold commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Fri, 30 Jun 2023 03:45:22 -0700


tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614476748


   > I wonder for PageIndex pruning in Rust implementions
   
   Currently the arrow-rs implementation uses the totalOrder predicate as 
defined by the IEEE 754 (2008 revision) floating point standard to order 
floats, this can be very efficiently implemented using some bit-twiddling and 
at least appears to define the standardised way to handle this. I believe 
DataFusion is using these same comparison kernels for evaluating pruning 
predicates, and so I would expect it to have similar behaviour with regards to 
NaNs.
   
   From the [Rust 
docs](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp):
   
   > The values are ordered in the following sequence:
   > 
   >     negative quiet NaN
   >     negative signaling NaN
   >     negative infinity
   >     negative numbers
   >     negative subnormal numbers
   >     negative zero
   >     positive zero
   >     positive subnormal numbers
   >     positive numbers
   >     positive infinity
   >     positive signaling NaN
   >     positive quiet NaN.
   
   > would it matter for adding [-inf, +inf] as min-max for all nan and null 
pages
   
   I haven't read the full backscroll, but the original PR's suggestion of just 
writing a NaN for a page only containing NaN seems perfectly logical to me, 
unlikely to cause compatibility issues, and significantly less surprising than 
writing a value that doesn't actually appear in the data...
   
   > Let's cc some of the maintainers of 
[parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet):
   
   I don't really know enough about the history of floating point comparison to 
weigh in on what the best solution is with any degree of authority, however, my 
2 cents is that the totalOrder predicate is the standardised way to handle this.
   
   Whilst I do agree that the behaviour of aggregate statistics containing NaNs 
might be unfortunate for some workloads, I'm not sure that special casing them 
is beneficial. Aside from the non-trivial additional complexity associated with 
special-casing them, if you don't include NaNs in statistics it is unclear to 
me how you can push down a comparison predicate as you have no way to know if 
the page contains NaNs? Perhaps that is what this PR seeks to address, but I do 
wonder if the simple solution might be worth considering...
   
   Also tagging @crepererum who may have further thoughts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] tustvold commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to