[GitHub] [parquet-format] westonpace commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Fri, 07 Jul 2023 09:32:24 -0700


westonpace commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625662823


   > CON: NaNs will be used in min/max bounds, even for not only-NaN pages. 
This makes them less effective for filtering (as they are the widest possible 
bounds) but @crepererum made a good point that this "special case for NaN" is 
quite arbitrary and we could also special case INT_MAX for integer columns, 
e.g.. I see the argument that keeping the architecture simple might be 
preferrable. Also NaNs are not widely used, so this will not be determinental 
to many data sets.
   
   I agree this is a con.  Total ordering is nice if the goal is to order the 
data.  If the goal is to filter the data then I think any consideration of 
NaN/null/infinity is meaningless.
   
   However, I also agree with @crepererum that this is a slippery slope and I 
agree with @JFinis that NaNs are not widely used and simpler is better.  I 
don't entirely agree the solution can always be to replace NaN/Infinity with 
NULL but the cases where it can't are probably very rare.  Besides, the penalty 
here is only a performance loss and not incorrect results so it's more 
manageable.
   
   So, on the balance, I'd say I'm neutral.  If there are other advantages to 
this approach then the disadvantages to dataset filtering are probably not 
enough outweigh them.  We might want to add a small sentence to some kind of 
pyarrow or parquet documentation somewhere so that users can be aware of this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] westonpace commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to