nishantmonu51 opened a new pull request #6222: Add ability to pass in Bloom filter from Hive Queries URL: https://github.com/apache/incubator-druid/pull/6222 This PR adds a BloomDimFilter which can be used by Apache Hive to pass in BloomFilters. Use Case - We have fact table in druid and slowly changing dimension/lookup tables in Apache Hive and need to join those tables. e.g. Consider the case of SSB Benchmark when lineorder is stored in Druid and parts table is in hive For following query from SSB Benchmark - ```sql select sum(total_revenue) from druid.ssb_lineorder_100, hive.ssb_lineorder_100 WHERE lo_partkey = p_partkey and p_category = 'MFGR#14'; ``` In the above query Hive can scan parts table, create a bloom filter for possible values for p_part_key where p_category = 'MFGR#14'. This bloom filter can then be pushed to Druid reducing the data that needs to scanned and transferred between Druid and Hive. Since BloomFilter is probablistic data structure and can have false positives. Hive will still need to do filtering while processing joins.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
