[jira] [Commented] (PARQUET-2255) BloomFilter and float point is ambiguous

Gabor Szadovszky (Jira) Mon, 13 Mar 2023 09:37:21 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699732#comment-17699732
 ]


Gabor Szadovszky commented on PARQUET-2255:
-------------------------------------------

But we don't build the dictionary for filtering but for encoding. We should not 
add anything else than what we have in the pages. So anything should be added 
to the read path.

Maybe we do not need to handle +0.0 and -0.0 differently from the other values. 
(We needed to handle them separately for min/max values because the comparison 
is not trivial and there were actual issues.) If someone deals with FP numbers 
they should know about the difference between +0.0 and -0.0. 

Because the FP spec allows to have multiple NaN values (even though java use 
one actual bitmap for it) we need to avoid using Bloom filter in this case. 
Dictionary is a different thing because we deserialize it to java Double/Float 
values in a Set so we will have one NaN value that is the very same one we are 
searching for. (It is more for the other implementations to deal with NaN if 
the language has several NaN values.)

> BloomFilter and float point is ambiguous
> ----------------------------------------
>
>                 Key: PARQUET-2255
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2255
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Xuwei Fu
>            Priority: Major
>             Fix For: format-2.9.0
>
>
> Currently, our Parquet can use BloomFilter for any physical types. However, 
> when BloomFilter apply on float:
>  # What does +0 -0 means? Are they equal?
>  # Should qNaN sNaN written in BloomFilter? Are they equal?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2255) BloomFilter and float point is ambiguous

Reply via email to