wgtmac commented on PR #2971: URL: https://github.com/apache/parquet-java/pull/2971#issuecomment-2796402724
Thanks for the review @mkaravel @paleolimbot @jiayuasu! Let me clarify some things about Parquet statistics first: - Parquet does not consider `NaN` when collecting min/max values for float/double values. So by no means we can know if any NaN value exists simply for the column statistics. - There is [an ongoing effort](https://github.com/apache/parquet-format/pull/221 ) to fix the above issue by introducing a total order: `-Inf < -NaN < Negative < -0.0 < +0.0 < Positive < NaN < +Inf` - Null values do not interfere with min/max values in the column statistics. If all values are null, we can deduce it by checking `null_count == value_count` to ignore the empty min/max values. Based on the above facts, I think we don't need to consider null values for geometry values. We still need to take care of NaN and invalid values. Then I have some questions: - What is an invalid value in a geometry feature? `NaN`? `+/-Inf`? Anything else? - From the above discussion, it seems that `+/-Inf` are invalid values in terms of a bbox. If that's true, definitely we should not make it as the final bbox to persist in the file. Is `NaN` a valid value in a bbox? Is it a good way to use `NaN` values for an empty bbox? - Is it a good approach to drop the entire bbox if any NaN or invalid value appears? In this way, we do not fail the writer at the cost of missing bbox. I'm in favor of this so we do not produce any confusing stats to users. It is really hard to downstream users to decide if the provided stats are reliable for predicate push down. I think @mkaravel's proposal (pasted as below) looks reasonable: ``` - Initialize the box to (inf, -inf, inf, -inf). - Compute the bounding box of the current geometry. - If any of the box's coordinates is NaN ignore that box. Otherwise merge it with the existing box for the group. - If the box is still (inf, -inf, inf, -inf) output (NaN, NaN, NaN, NaN) for the group. Otherwise, output the box we have already computed. ``` For the last step, I think it is better to drop the bbox instead of writing all `NaN`s to confuse users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
