Michal Nowakiewicz created ARROW-15239:
------------------------------------------

             Summary: [C++][Compute] Introduce Bloom filters to hash join
                 Key: ARROW-15239
                 URL: https://issues.apache.org/jira/browse/ARROW-15239
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
    Affects Versions: 6.0.0
            Reporter: Michal Nowakiewicz
            Assignee: Michal Nowakiewicz
             Fix For: 7.0.0


Bloom filters are a common way to improve performance of hash joins where many 
rows on the probe side of the hash join do not have matches on the build side. 
Bloom filters are often able to reduce the cost of eliminating such rows early 
in the processing pipeline, since they are cheaper to probe than the hash join 
hash table, but they can return false positives for a reasonably small 
percentage of inputs.

This task is about introducing a data structure of register blocked Bloom 
filter implementation (a practical modification of Bloom filter concept that is 
specifically tuned for use in query processing related to hash joins and both 
more space efficient and less costly than using hash table for filtering). The 
data structure should provide functionality for parallel construction from a 
vector of exec batches accumulated in memory and vectorized lookup and 
filtering for a single exec batch. It should not have a limit on the size of 
the Bloom filter (the number of inserted hashes), which requires using 64-bit 
hashes for larger inputs. It should be verified that build and probe costs are 
reasonable low and false positives rate is at most few percent (which should be 
acceptable in use for query processing).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to