wgtmac commented on PR #494:
URL: https://github.com/apache/parquet-format/pull/494#issuecomment-2814400884

   > Is this what we really want? I would expect that if I see a min/max value 
in the geo statistics that I would be able to safely skip the rows.
   
   @mkaravel If writers are instructed to drop the bbox if there is any NaN 
value. How would you skip the rows if there are some (but not all) NaN values 
in the bbox? Shouldn't we consider it is malformed?
   
   > If there are non-NULL values which are all either empty or "invalid" then 
write the empty bounding box.
   
   Ah, I got your point. In total there are five cases:
   
   - (1) All non-null geometry features are valid: produce bbox as usual.
   - (2) Some non-null geometry features are valid but other non-null values 
are not: produce bbox with only valid data. (Do we need an extra field to 
indicate there are invalid data?)
   - (3) All non-null geometry features are invalid: we need an empty bbox or 
yet another field to indicate there are no valid data?
   - (4) All values are null: it can be deduced from 
`ColumnMetaData::num_values == ColumnMetaData::statistics.null_count`.
   - (5) No values (a.k.a. an empty row group with 0 rows): it can be deduced 
from `RowGroup.num_rows == 0`.
   
   It seems that we need a clear approach to indicate there are no valid data 
(the case 3 above). I'm hesitant to use a bbox of all NaNs to represent an 
empty bbox because it complicates the logic to deal with cases where some 
values are NaN while others are not. Proposal from @paleolimbot to introduce 
`dimensions_that_have_zero_non_nan_values` might also complicate the case 
because we need extra checks and might also introduce separate fields for Z and 
M axises?
   
   Is this a common case and the invalid data should be preserved as is? Or can 
we simplify this by writing null for invalid data to Parquet so we can 
eliminate case 2 and 3 above?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to