[GitHub] [arrow-rs] crepererum commented on issue #264: Include NaN in Parquet stats (again)

GitBox Fri, 07 May 2021 01:14:23 -0700


crepererum commented on issue #264:
URL: https://github.com/apache/arrow-rs/issues/264#issuecomment-834160981



   > If we really want to add NaN to the stats I think it would help to 
articulate an actual usecase where having a NaN there would be useful
   
   The following come to my mind:
   
   - PostgreSQL-style SQL (which IIRC DataFusion follows) allows `X == NaN` to 
be queried. For that, "contains NaN" is an important information
   - PostgreSQL-style SQL (again for DataFusion) sorts NaNs at the end (aka 
after +inf) of the spectrum, so at least "contains NaN" would be useful.
   
   However, "contains NaN" only makes sense for Parquet Float and Double and 
the current parquet standard stat do NOT provide us any way to express this. 
And since for most query cases the point can be made that NaN shall be included 
into the total order somewhere, I would rather put that info into min/max 
instead of creating some convoluted edge case, i.e.
   
   > Query processors rely on min/max to figure out the full value range. 
However for Float and Double (and only for these) there is also this special 
flag that does not exist for any other data type that tells you about the 
existence of that one value class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] crepererum commented on issue #264: Include NaN in Parquet stats (again)

Reply via email to