[ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697455#comment-17697455
 ] 

ASF GitHub Bot commented on PARQUET-2237:
-----------------------------------------

wgtmac commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1458274839

   > Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   > 
   > Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?
   
   What I did in production is to issue async I/Os of dictionaries (if all data 
pages are dictionary-encoded in that column chunk and the dictionary is not 
big) and bloom filters of selected row groups in advance. The reason is to 
eliminate blocking I/O when pushing down the predicates. However, the parquet 
specs only records the offset to bloom filter. So I also added the length of 
each bloom filter in the key value metadata section (probably a good reason to 
add to the specs as well?)




> Improve performance when filters in RowGroupFilter can match exactly
> --------------------------------------------------------------------
>
>                 Key: PARQUET-2237
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2237
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to