alamb commented on issue #264: URL: https://github.com/apache/arrow-rs/issues/264#issuecomment-834682151
I am coming at this from a query processing point of view and using statistics to rule out entire row groups (or equivalents). Having a clearly defined sort order for floats is important for sorting, but for statistics I feel like including `Nan` values in a column effectively "poisons" the effective use I'll copy / paste my example from https://github.com/influxdata/influxdb_iox/pull/1448#discussion_r628230059. In that case, where your data was like ``` f --- 1.1 2.1 ... (1 Billion other values between 1.1 and 9.9) Nan 9.9 ``` If you have a predicate that is like `f > 10.0` (which does not evaluate to true for the one `Nan` row) the query engine will have to scan 1 Billion extra rows due to the presence of a single null value Queries that are specifically looking for Nan I think are much less important to optimize. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
