[GitHub] [incubator-druid] nishantmonu51 opened pull request #6222: Add ability to pass in Bloom filter from Hive Queries

GitHub Thu, 23 Aug 2018 06:16:13 -0700

This PR adds a BloomDimFilter which can be used by Apache Hive to pass in 
BloomFilters.


Use Case - 
We have fact table in druid and slowly changing dimension/lookup tables in 
Apache Hive and need to join those tables. 
e.g. Consider the case of SSB Benchmark when lineorder is stored in Druid and 
parts table is in hive For following query from SSB Benchmark - 
```sql
select sum(total_revenue) from druid.ssb_lineorder_100, hive.ssb_lineorder_100 
WHERE lo_partkey = p_partkey and p_category = 'MFGR#14';
```
In the above query Hive can scan parts table, create a bloom filter for 
possible values for p_part_key where p_category = 'MFGR#14'. This bloom filter 
can then be pushed to Druid reducing the data that needs to scanned and 
transferred between Druid and Hive. 
Since BloomFilter is probablistic data structure and can have false positives. 
Hive will still need to do filtering while processing joins. 

[ Full content available at: 
https://github.com/apache/incubator-druid/pull/6222 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-druid] nishantmonu51 opened pull request #6222: Add ability to pass in Bloom filter from Hive Queries

Reply via email to