paleolimbot commented on PR #2971: URL: https://github.com/apache/parquet-java/pull/2971#issuecomment-2764725793
Thank you for the review! > If at the end of processing a row group you still see a bbox of (inf, -inf, inf, -inf) then expose it as (NaN, NaN, NaN, NaN). This means you have not found any non-empty geometry... Probably the main issue here is that in the Thrift `BoundingBox`, only `zmin/max` and `mmin/max` are `optional`, and I forgot when reviewing that about the 100% EMPTY case for x and y. The entire `BoundingBox` is technically `optional` in Thrift, but we use an "unset" `BoundingBox` to definitively mean that there were no x/y values it needs to be very clear in the spec. Either way we should add a clarification that unset `zmin/max` and `mmin/max` can only occur if there were no non-NaN Z or M values encountered (if that is indeed what we meant by that). If `NaN`s are the convention we land on (instead of unset BoundingBox), it's critical that that `NaN`s are not generated for any other case (for example, it's critical to ensure that a column chunk containing the invalid geometry `LINESTRING (1 2, nan nan, 3 4)`, does *not* expose NaN, NaN, NaN, NaN as its x--y statistics, because that would result in a predicate pushdown skipping the entire array which may contain other values that do satisfy the predicate). If my reading of the current state of this PR is correct, this still needs to be fixed here. I personally like exposing inf, -inf, inf, -inf better than NaN (currently what C++ does) because it is impossible (I think) to generate this by accident and "just works" with the formula provided in the Parquet format for interval containment (we only provided the formula for `x`, but it's reasonable for somebody to read it and apply the non-wraparound version to other dimensions). > Could you please share a link to the C++ implementation? Would be very interested in taking a look. The PR is here: https://github.com/apache/arrow/pull/45459 ...and the files are geospatial_statistics.h/.cc and geospatial_util_internal.cc. Comments welcome! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
