[
https://issues.apache.org/jira/browse/ARROW-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-15239:
-----------------------------------
Labels: pull-request-available query-engine (was: query-engine)
> [C++][Compute] Introduce Bloom filters to hash join
> ---------------------------------------------------
>
> Key: ARROW-15239
> URL: https://issues.apache.org/jira/browse/ARROW-15239
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 6.0.0
> Reporter: Michal Nowakiewicz
> Assignee: Michal Nowakiewicz
> Priority: Major
> Labels: pull-request-available, query-engine
> Fix For: 7.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Bloom filters are a common way to improve performance of hash joins where
> many rows on the probe side of the hash join do not have matches on the build
> side. Bloom filters are often able to reduce the cost of eliminating such
> rows early in the processing pipeline, since they are cheaper to probe than
> the hash join hash table, but they can return false positives for a
> reasonably small percentage of inputs.
> This task is about introducing a data structure of register blocked Bloom
> filter implementation (a practical modification of Bloom filter concept that
> is specifically tuned for use in query processing related to hash joins and
> both more space efficient and less costly than using hash table for
> filtering). The data structure should provide functionality for parallel
> construction from a vector of exec batches accumulated in memory and
> vectorized lookup and filtering for a single exec batch. It should not have a
> limit on the size of the Bloom filter (the number of inserted hashes), which
> requires using 64-bit hashes for larger inputs. It should be verified that
> build and probe costs are reasonable low and false positives rate is at most
> few percent (which should be acceptable in use for query processing).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)