yabola commented on PR #1023: URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1458436214
> Thanks @yabola for coming up with this idea. Let's continue the discussion about the BloomFilter building idea in the jira. > > Meanwhile, I've been thinking about the actual problem as well. Currently, for row group filtering we are checking the min/max values first which is correct since this is the most fast thing to do. Then the dictionary and then the bloom filter. The ordering of the latter two is not obvious to me in every scenarios. At the time of filtering we did not start reading the actual row group so there is no advantage in I/O to read the dictionary first. Furthermore, searching something in the bloom filter is much faster than in the dictionary. And the size of the bloom filter is probably less than the size of the dictionary. Though, it would require some measurements to prove if it is a good idea to get the bloom filter before the dictionary. What do you think? It is a good idea to adjust filter order and prefer the use of lighter filters first to judge. But I have some concern (not sure if it is correct): In parquet dictionary will generate only in low-base data( see `parquet.dictionary.page.size` 1 MB), and BloomFilter is usually used in high base columns(?) . So ideally only one of these two will be used(?) And ideally we should only use one of these two (don't judge both of them). If there is a BloomFilter and filter is `=` or `in`, only use the BloomFilter , otherwise use the dictionary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org