westonpace commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625662823
> CON: NaNs will be used in min/max bounds, even for not only-NaN pages. This makes them less effective for filtering (as they are the widest possible bounds) but @crepererum made a good point that this "special case for NaN" is quite arbitrary and we could also special case INT_MAX for integer columns, e.g.. I see the argument that keeping the architecture simple might be preferrable. Also NaNs are not widely used, so this will not be determinental to many data sets. I agree this is a con. Total ordering is nice if the goal is to order the data. If the goal is to filter the data then I think any consideration of NaN/null/infinity is meaningless. However, I also agree with @crepererum that this is a slippery slope and I agree with @JFinis that NaNs are not widely used and simpler is better. I don't entirely agree the solution can always be to replace NaN/Infinity with NULL but the cases where it can't are probably very rare. Besides, the penalty here is only a performance loss and not incorrect results so it's more manageable. So, on the balance, I'd say I'm neutral. If there are other advantages to this approach then the disadvantages to dataset filtering are probably not enough outweigh them. We might want to add a small sentence to some kind of pyarrow or parquet documentation somewhere so that users can be aware of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
