mkaravel commented on PR #2971:
URL: https://github.com/apache/parquet-java/pull/2971#issuecomment-2778773392

   > Thank you for the review!
   > 
   > > If at the end of processing a row group you still see a bbox of (inf, 
-inf, inf, -inf) then expose it as (NaN, NaN, NaN, NaN). This means you have 
not found any non-empty geometry...
   > 
   > Probably the main issue here is that in the Thrift `BoundingBox`, only 
`zmin/max` and `mmin/max` are `optional`, and I forgot when reviewing that 
about the 100% EMPTY case for x and y. The entire `BoundingBox` is technically 
`optional` in Thrift, but we use an "unset" `BoundingBox` to definitively mean 
that there were no x/y values it needs to be very clear in the spec. Either way 
we should add a clarification that unset `zmin/max` and `mmin/max` can only 
occur if there were no non-NaN Z or M values encountered (if that is indeed 
what we meant by that).
   > 
   > If `NaN`s are the convention we land on (instead of unset BoundingBox), 
it's critical that that `NaN`s are not generated for any other case (for 
example, it's critical to ensure that a column chunk containing the invalid 
geometry `LINESTRING (1 2, nan nan, 3 4)`, does _not_ expose NaN, NaN, NaN, NaN 
as its x--y statistics, because that would result in a predicate pushdown 
skipping the entire array which may contain other values that do satisfy the 
predicate). If my reading of the current state of this PR is correct, this 
still needs to be fixed here.
   > 
   > I personally like exposing inf, -inf, inf, -inf better than NaN (currently 
what C++ does) because it is impossible (I think) to generate this by accident 
and "just works" with the formula provided in the Parquet format for interval 
containment (we only provided the formula for `x`, but it's reasonable for 
somebody to read it and apply the non-wraparound version to other dimensions).
   > 
   > > Could you please share a link to the C++ implementation? Would be very 
interested in taking a look.
   > 
   > The PR is here: 
[apache/arrow#45459](https://github.com/apache/arrow/pull/45459) ...and the 
files are geospatial_statistics.h/.cc and geospatial_util_internal.cc. Comments 
welcome!
   
   @paleolimbot Thank you for the detailed answer. You are touching upon a few 
important things, I think we need to figure them out.
   
   First at a very high level: I think we should update the Parquet spec (and 
Iceberg I would say) to explicitly define the behavior in the edge case where 
the column has all its geometries empty. I also think we should first define 
the behavior we want and then move on with the implementation rather than try 
to do it the other way around.
   
   I can see the following interesting cases/considerations which are of course 
can be combined (and thank you for bringing up the issue with geometries with 
invalid coordinates):
   * NULL values.
   * Empty geometries.
   * Geometries with invalid coordinates.
   
   My understanding is that for the purposes of statistics computation we want 
to skip all such rows. Another option would be to fail reading or writing when 
we have geometries with invalid coordinates, but I would like to hear opinions 
about this direction which is quite invasive. So for now I would assume that we 
do not want to fail in such cases, but rather compute the statistics in a safe 
manner.
   
   So the next question is how do we compute bounding boxes in the presence of 
such geometries (empty and invalid). Regarding empty geometries I proposed to 
use NaNs for the box values because NaNs are also typically used to represent 
the coordinates of empty points (JTS does that). It made more sense to me to 
have this "symmetry" (for example think of extracting the west-south corner of 
a box as a point: if you use NaNs you immediately get an empty point which 
makes sense to be conceptually). This is why I proposed to use NaNs for the 
case of entire column of empty geometries.
   
   As you very well pointed out, this leaves us with a question about how to 
deal with geometries with invalid coordinates (the question is even more 
interesting for geographies, but let's leave this for another 
thread/discussion). I can see two kinds of invalid coordinates: NaNs and 
infinity values. Everything else (for geometries) should be considered as valid.
   
   Let's assume now that we have a bounding box computation algorithm that will 
just compare coordinates using any double comparison operator except `!=` (I 
believe the JTS implementation does that). I am also assuming that a bounding 
box computation algorithm will also skip empty point, empty points in 
multipoints, and empty points inside geometry collections, appearing either as 
points or points inside a multipoint (I believe the JTS implementation also 
does that). In this case any NaN coordinates in the input will surface as NaNs 
in the resulting box. Inf coordinates may or may not surface up. In the end for 
a given geometry we end up with a bounding box that possibly has NaN and inf 
coordinates. By thinking is that we can ignore geometries (for the row group 
bounding box computation) that have at least one NaN coordinate in their box 
and keep the inf coordinates as they are. So basically, the flow for computing 
the bounding box for a group of geometries would be like:
   * Initialize the box to `(inf, -inf, inf, -inf)`.
   * Compute the bounding box of the current geometry.
   * If any of the box's coordinates is NaN ignore that box. Otherwise merge it 
with the existing box for the group.
   * If the box is still `(inf, -inf, inf, -inf)` output `(NaN, NaN, NaN, NaN)` 
for the group. Otherwise, output the box we have already computed.
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to