Re: [PR] GH-493: Clarify Bounding Box Behavior in GeospatialStatistics [parquet-format]

via GitHub Fri, 18 Apr 2025 09:12:20 -0700


paleolimbot commented on PR #494:
URL: https://github.com/apache/parquet-format/pull/494#issuecomment-2815592653


   > I am hesitant modifying user data on-write. Is there a precedent in 
Parquet where something like that happens? I mean data that is invalid being 
written as null or omitted?
   
   If this is a reference to
   
   > Is this a common case and the invalid data should be preserved as is? Or 
can we simplify this by writing null for invalid data to Parquet so we can 
eliminate case 2 and 3 above?
   
   Assuming that "invalid" means EMPTY or something with NaNs in it, I view 
this as something that a higher level wrapper (e.g., Sedona, Spark, or 
GeoPandas) might expose as an option to allow future readers of the file to 
have more useful statistics, and not something we would handle in any Parquet 
writer. Some tools generate EMPTYs, some tools generate nulls, and some 
computations may accidentally put nan NaN in a value. It's Parquet's job (in my 
opinion) to transport all of those cases faithfully to a tool that can do some 
validation and fix things if needed based on user input. (I think that's where 
we are right now!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] GH-493: Clarify Bounding Box Behavior in GeospatialStatistics [parquet-format]

Reply via email to