JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1587083232
I finally have time to continue on this. Sorry for the long wait. As @gszadovszky has highlighted, we have to store a valid double/float value into the min/max bounds of the column index to be compatible with legacy readers. So the initial proposal to write NaN into min/max in this case would actually work. But so far not everyone was happy with using these NaNs in readers to see whether we have an only-nan page. Therefore, the suggestion was to also add `nan_pages` to the column options (favored by @wgtmac and @mapleFU). I have updated the PR to this suggestion: We still would write NaNs into min/max in the column index if a page has only NaNs but advise the reader to not use these values (as readers are already advised today) and instead only use `nan_pages` to check for only-nan pages. This way, we don't need to worry about the semantics of NaN comparisions and readers can continue to ignore all NaN values they find in bounds. I have not updated the PR description yet to reflect this new design; only the files themselves have been updated. @wgtmac @mapleFU @gszadovszky Please review and let me know if you agree with this design. Then I will update the PR description accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
