wgtmac commented on PR #494: URL: https://github.com/apache/parquet-format/pull/494#issuecomment-2814400884
> Is this what we really want? I would expect that if I see a min/max value in the geo statistics that I would be able to safely skip the rows. @mkaravel If writers are instructed to drop the bbox if there is any NaN value. How would you skip the rows if there are some (but not all) NaN values in the bbox? Shouldn't we consider it is malformed? > If there are non-NULL values which are all either empty or "invalid" then write the empty bounding box. Ah, I got your point. In total there are five cases: - (1) All non-null geometry features are valid: produce bbox as usual. - (2) Some non-null geometry features are valid but other non-null values are not: produce bbox with only valid data. (Do we need an extra field to indicate there are invalid data?) - (3) All non-null geometry features are invalid: we need an empty bbox or yet another field to indicate there are no valid data? - (4) All values are null: it can be deduced from `ColumnMetaData::num_values == ColumnMetaData::statistics.null_count`. - (5) No values (a.k.a. an empty row group with 0 rows): it can be deduced from `RowGroup.num_rows == 0`. It seems that we need a clear approach to indicate there are no valid data (the case 3 above). I'm hesitant to use a bbox of all NaNs to represent an empty bbox because it complicates the logic to deal with cases where some values are NaN while others are not. Proposal from @paleolimbot to introduce `dimensions_that_have_zero_non_nan_values` might also complicate the case because we need extra checks and might also introduce separate fields for Z and M axises? Is this a common case and the invalid data should be preserved as is? Or can we simplify this by writing null for invalid data to Parquet so we can eliminate case 2 and 3 above? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
