[GitHub] [arrow-rs] alamb edited a comment on issue #264: Include NaN in Parquet stats (again)

GitBox Fri, 07 May 2021 11:43:43 -0700


alamb edited a comment on issue #264:
URL: https://github.com/apache/arrow-rs/issues/264#issuecomment-834682151



   I am coming at this from a query processing point of view and using 
statistics to rule out entire row groups (or equivalents). Having a clearly 
defined sort order for floats is important for sorting, but for statistics I 
feel like including `Nan` values in a column effectively "poisons" the 
effective use
   
   I'll copy / paste my example from 
https://github.com/influxdata/influxdb_iox/pull/1448#discussion_r628230059. In  
that case,  where your data was like
   ```
   f
   ---
   1.1
   2.1
   ... (1 Billion other values between 1.1 and 9.9)
   Nan
   9.9
   ```
   
   If you have a predicate that is like `f > 10.0` (which does not evaluate to 
true for the one `Nan` row) the query engine will have to scan 1 Billion extra 
rows due to the presence of a single null value
   
   Queries that are specifically looking for Nan I think are much less 
important to optimize.
   
   I like the idea of getting some consensus on what to do with Nans and 
statistics on the parquet mailing list


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb edited a comment on issue #264: Include NaN in Parquet stats (again)

Reply via email to