[GitHub] [parquet-format] crepererum commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Fri, 07 Jul 2023 03:43:09 -0700


crepererum commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625220272


   I agree w/ @tustvold's standpoint. Some thoughts on top of what he wrote:
   
   IMHO this is leaking application details into the storage format. If you 
start to differentiate NaN from "all normal values" and NULL you may do the 
same for +/-Inf, because it also acts as a poison value in most computations. 
But you may also do that for "nearly Inf" because someone divided by "nearly 
zero" and these super big values are equally nonsensical. This whole discussion 
isn't even specific to floats. Why do boolean stats not count true/false 
separately? What about empty strings and byte arrays? Or empty lists in 
general? My point is: this is opening a can of worms and the complexity isn't 
worth the gain.
   
   The better alternative is: let the user cast invalid values to NULL if they 
wanna exclude them from their data, because this is exactly what missing values 
were invented for. If they still want to store broken data and want to have 
some niche understanding of statistics, provide a way to attach 
application-defined stats to parquet (this extends to a number of histogram 
types or counts of other "special" values). Keep the storage format baseline 
simple. IEEE total ordering is well defined and universally agreed upon. I 
think the world doesn't need yet another special floating point treatment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-format] crepererum commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to