paleolimbot commented on PR #2971:
URL: https://github.com/apache/parquet-java/pull/2971#issuecomment-2764725793

   Thank you for the review!
   
   > If at the end of processing a row group you still see a bbox of (inf, 
-inf, inf, -inf) then expose it as (NaN, NaN, NaN, NaN). This means you have 
not found any non-empty geometry...
   
   Probably the main issue here is that in the Thrift `BoundingBox`, only 
`zmin/max` and `mmin/max` are `optional`, and I forgot when reviewing that 
about the 100% EMPTY case for x and y. The entire `BoundingBox` is technically 
`optional` in Thrift, but we use an "unset" `BoundingBox` to definitively mean 
that there were no x/y values it needs to be very clear in the spec. Either way 
we should add a clarification that unset `zmin/max` and `mmin/max` can only 
occur if there were no non-NaN Z or M values encountered (if that is indeed 
what we meant by that).
   
   If `NaN`s are the convention we land on (instead of unset BoundingBox), it's 
critical that that `NaN`s are not generated for any other case (for example, 
it's critical to ensure that a column chunk containing the invalid geometry 
`LINESTRING (1 2, nan nan, 3 4)`, does *not* expose NaN, NaN, NaN, NaN as its 
x--y statistics, because that would result in a predicate pushdown skipping the 
entire array which may contain other values that do satisfy the predicate). If 
my reading of the current state of this PR is correct, this still needs to be 
fixed here.
   
   I personally like exposing inf, -inf, inf, -inf better than NaN (currently 
what C++ does) because it is impossible (I think) to generate this by accident 
and "just works" with the formula provided in the Parquet format for interval 
containment (we only provided the formula for `x`, but it's reasonable for 
somebody to read it and apply the non-wraparound version to other dimensions).
   
   > Could you please share a link to the C++ implementation? Would be very 
interested in taking a look.
   
   The PR is here: https://github.com/apache/arrow/pull/45459 ...and the files 
are geospatial_statistics.h/.cc and geospatial_util_internal.cc. Comments 
welcome!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to