[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739025#comment-17739025 ]
ASF GitHub Bot commented on PARQUET-2249: ----------------------------------------- tustvold commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614476748 > I wonder for PageIndex pruning in Rust implementions Currently the arrow-rs implementation uses the totalOrder predicate as defined by the IEEE 754 (2008 revision) floating point standard to order floats, this can be very efficiently implemented using some bit-twiddling and at least appears to define the standardised way to handle this. I believe DataFusion is using these same comparison kernels for evaluating pruning predicates, and so I would expect it to have similar behaviour with regards to NaNs. From the [Rust docs](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp): > The values are ordered in the following sequence: > > negative quiet NaN > negative signaling NaN > negative infinity > negative numbers > negative subnormal numbers > negative zero > positive zero > positive subnormal numbers > positive numbers > positive infinity > positive signaling NaN > positive quiet NaN. > would it matter for adding [-inf, +inf] as min-max for all nan and null pages I haven't read the full backscroll, but the original PR's suggestion of just writing a NaN for a page only containing NaN seems perfectly logical to me, unlikely to cause compatibility issues, and significantly less surprising than writing a value that doesn't actually appear in the data... > Let's cc some of the maintainers of [parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet): I don't really know enough about the history of floating point comparison to weigh in on what the best solution is with any degree of authority, however, my 2 cents is that the totalOrder predicate is the standardised way to handle this. Whilst I do agree that the behaviour of aggregate statistics containing NaNs might be unfortunate for some workloads, I'm not sure that special casing them is beneficial. Aside from the non-trivial additional complexity associated with special-casing them, if you don't include NaNs in statistics it is unclear to me how you can push down a comparison predicate as you have no way to know if the page contains NaNs? Perhaps that is what this PR seeks to address, but I do wonder if the simple solution might be worth considering... Also tagging @crepererum who may have further thoughts > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > ----------------------------------------------------------------------- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format > Reporter: Jan Finis > Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > * When writing statistics the following rules should be followed: > * - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > /** > * A list of Boolean values to determine the validity of the corresponding > * min and max values. If true, a page contains only null values, and > writers > * have to set the corresponding entries in min_values and max_values to > * byte[0], so that all lists have the same length. If false, the > * corresponding entries in min_values and max_values must be valid. > */ > 1: required list<bool> null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > -- This message was sent by Atlassian Jira (v8.20.10#820010)