crepererum commented on issue #264: URL: https://github.com/apache/arrow-rs/issues/264#issuecomment-834160981
> If we really want to add NaN to the stats I think it would help to articulate an actual usecase where having a NaN there would be useful The following come to my mind: - PostgreSQL-style SQL (which IIRC DataFusion follows) allows `X == NaN` to be queried. For that, "contains NaN" is an important information - PostgreSQL-style SQL (again for DataFusion) sorts NaNs at the end (aka after +inf) of the spectrum, so at least "contains NaN" would be useful. However, "contains NaN" only makes sense for Parquet Float and Double and the current parquet standard stat do NOT provide us any way to express this. And since for most query cases the point can be made that NaN shall be included into the total order somewhere, I would rather put that info into min/max instead of creating some convoluted edge case, i.e. > Query processors rely on min/max to figure out the full value range. However for Float and Double (and only for these) there is also this special flag that does not exist for any other data type that tells you about the existence of that one value class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
