paleolimbot commented on PR #494: URL: https://github.com/apache/parquet-format/pull/494#issuecomment-2815592653
> I am hesitant modifying user data on-write. Is there a precedent in Parquet where something like that happens? I mean data that is invalid being written as null or omitted? If this is a reference to > Is this a common case and the invalid data should be preserved as is? Or can we simplify this by writing null for invalid data to Parquet so we can eliminate case 2 and 3 above? Assuming that "invalid" means EMPTY or something with NaNs in it, I view this as something that a higher level wrapper (e.g., Sedona, Spark, or GeoPandas) might expose as an option to allow future readers of the file to have more useful statistics, and not something we would handle in any Parquet writer. Some tools generate EMPTYs, some tools generate nulls, and some computations may accidentally put nan NaN in a value. It's Parquet's job (in my opinion) to transport all of those cases faithfully to a tool that can do some validation and fix things if needed based on user input. (I think that's where we are right now!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
