[ https://issues.apache.org/jira/browse/PARQUET-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699732#comment-17699732 ]
Gabor Szadovszky commented on PARQUET-2255: ------------------------------------------- But we don't build the dictionary for filtering but for encoding. We should not add anything else than what we have in the pages. So anything should be added to the read path. Maybe we do not need to handle +0.0 and -0.0 differently from the other values. (We needed to handle them separately for min/max values because the comparison is not trivial and there were actual issues.) If someone deals with FP numbers they should know about the difference between +0.0 and -0.0. Because the FP spec allows to have multiple NaN values (even though java use one actual bitmap for it) we need to avoid using Bloom filter in this case. Dictionary is a different thing because we deserialize it to java Double/Float values in a Set so we will have one NaN value that is the very same one we are searching for. (It is more for the other implementations to deal with NaN if the language has several NaN values.) > BloomFilter and float point is ambiguous > ---------------------------------------- > > Key: PARQUET-2255 > URL: https://issues.apache.org/jira/browse/PARQUET-2255 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Xuwei Fu > Priority: Major > Fix For: format-2.9.0 > > > Currently, our Parquet can use BloomFilter for any physical types. However, > when BloomFilter apply on float: > # What does +0 -0 means? Are they equal? > # Should qNaN sNaN written in BloomFilter? Are they equal? > -- This message was sent by Atlassian Jira (v8.20.10#820010)