JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1587083232

   I finally have time to continue on this. Sorry for the long wait.
   
   As @gszadovszky has highlighted, we have to store a valid double/float value 
into the min/max bounds of the column index to be compatible with legacy 
readers. So the initial proposal to write NaN into min/max in this case would 
actually work.
   
   But so far not everyone was happy with using these NaNs in readers to see 
whether we have an only-nan page. Therefore, the suggestion was to also add 
`nan_pages` to the column options (favored by @wgtmac and @mapleFU). I have 
updated the PR to this suggestion: We still would write NaNs into min/max in 
the column index if a page has only NaNs but advise the reader to not use these 
values (as readers are already advised today) and instead only use `nan_pages` 
to check for only-nan pages. This way, we don't need to worry about the 
semantics of NaN comparisions and readers can continue to ignore all NaN values 
they find in bounds.
   
   I have not updated the PR description yet to reflect this new design; only 
the files themselves have been updated. @wgtmac @mapleFU @gszadovszky Please 
review and let me know if you agree with this design. Then I will update the PR 
description accordingly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to