tustvold commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614476748
> I wonder for PageIndex pruning in Rust implementions Currently the arrow-rs implementation uses the totalOrder predicate as defined by the IEEE 754 (2008 revision) floating point standard to order floats, this can be very efficiently implemented using some bit-twiddling and at least appears to define the standardised way to handle this. I believe DataFusion is using these same comparison kernels for evaluating pruning predicates, and so I would expect it to have similar behaviour with regards to NaNs. From the [Rust docs](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp): > The values are ordered in the following sequence: > > negative quiet NaN > negative signaling NaN > negative infinity > negative numbers > negative subnormal numbers > negative zero > positive zero > positive subnormal numbers > positive numbers > positive infinity > positive signaling NaN > positive quiet NaN. > would it matter for adding [-inf, +inf] as min-max for all nan and null pages I haven't read the full backscroll, but the original PR's suggestion of just writing a NaN for a page only containing NaN seems perfectly logical to me, unlikely to cause compatibility issues, and significantly less surprising than writing a value that doesn't actually appear in the data... > Let's cc some of the maintainers of [parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet): I don't really know enough about the history of floating point comparison to weigh in on what the best solution is with any degree of authority, however, my 2 cents is that the totalOrder predicate is the standardised way to handle this. Whilst I do agree that the behaviour of aggregate statistics containing NaNs might be unfortunate for some workloads, I'm not sure that special casing them is beneficial. Aside from the non-trivial additional complexity associated with special-casing them, if you don't include NaNs in statistics it is unclear to me how you can push down a comparison predicate as you have no way to know if the page contains NaNs? Perhaps that is what this PR seeks to address, but I do wonder if the simple solution might be worth considering... Also tagging @crepererum who may have further thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
