[jira] [Updated] (ARROW-15239) [C++][Compute] Introduce Bloom filters to hash join

ASF GitHub Bot (Jira) Mon, 03 Jan 2022 21:31:18 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated ARROW-15239:
-----------------------------------
    Labels: pull-request-available query-engine  (was: query-engine)

> [C++][Compute] Introduce Bloom filters to hash join
> ---------------------------------------------------
>
>                 Key: ARROW-15239
>                 URL: https://issues.apache.org/jira/browse/ARROW-15239
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 6.0.0
>            Reporter: Michal Nowakiewicz
>            Assignee: Michal Nowakiewicz
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 7.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Bloom filters are a common way to improve performance of hash joins where 
> many rows on the probe side of the hash join do not have matches on the build 
> side. Bloom filters are often able to reduce the cost of eliminating such 
> rows early in the processing pipeline, since they are cheaper to probe than 
> the hash join hash table, but they can return false positives for a 
> reasonably small percentage of inputs.
> This task is about introducing a data structure of register blocked Bloom 
> filter implementation (a practical modification of Bloom filter concept that 
> is specifically tuned for use in query processing related to hash joins and 
> both more space efficient and less costly than using hash table for 
> filtering). The data structure should provide functionality for parallel 
> construction from a vector of exec batches accumulated in memory and 
> vectorized lookup and filtering for a single exec batch. It should not have a 
> limit on the size of the Bloom filter (the number of inserted hashes), which 
> requires using 64-bit hashes for larger inputs. It should be verified that 
> build and probe costs are reasonable low and false positives rate is at most 
> few percent (which should be acceptable in use for query processing).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15239) [C++][Compute] Introduce Bloom filters to hash join

Reply via email to