[ 
https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767601#comment-17767601
 ] 

Riza Suminto edited comment on IMPALA-12455 at 9/21/23 3:00 PM:
----------------------------------------------------------------

I suspect this strategy can cost more in terms of memory space?

I hack bloom-filter-test to get some number using TPCD-DS Q95 as example.
{code:java}
  // TPC-DS 3TB has NDV(ws_order_number) = 176332106
  int ws_order_number_min_size = BloomFilter::MinLogSpace(176332106, 0.75);
  EXPECT_EQ(26, ws_order_number_min_size);

  // Assume we have 512 num_instances
  int ws_order_number_min_size_div_512 = BloomFilter::MinLogSpace(344398, 0.75);
  EXPECT_EQ(17, ws_order_number_min_size_div_512); {code}
In this case, having 1 filter of size 2^26 (64MB) will cost the same memory as 
having 512 filters of size 2^17 (128KB).
But Impala has RUNTIME_FILTER_MAX_SIZE of 2^24 (16MB), so it will cost 2^4 more 
memory space per consumer to have 512 filters of size 2^17 instead of 1 filter 
of 2^24 (capped at RUNTIME_FILTER_MAX_SIZE)?


was (Author: rizaon):
I suspect this strategy can cost more in terms of memory space?

I have bloom-filter-test to get some number using TPCD-DS Q95 as example.
{code:java}
  // TPC-DS 3TB has NDV(ws_order_number) = 176332106
  int ws_order_number_min_size = BloomFilter::MinLogSpace(176332106, 0.75);
  EXPECT_EQ(26, ws_order_number_min_size);

  // Assume we have 512 num_instances
  int ws_order_number_min_size_div_512 = BloomFilter::MinLogSpace(344398, 0.75);
  EXPECT_EQ(17, ws_order_number_min_size_div_512); {code}
In this case, having 1 filter of size 2^26 (64MB) will cost the same memory as 
having 512 filters of size 2^17 (128KB).
But Impala has RUNTIME_FILTER_MAX_SIZE of 2^24 (16MB), so it will cost 2^4 more 
memory space per consumer to have 512 filters of size 2^17 instead of 1 filter 
of 2^24 (capped at RUNTIME_FILTER_MAX_SIZE)?

> Create set of disjunct bloom filters for keys in partitioned builds
> -------------------------------------------------------------------
>
>                 Key: IMPALA-12455
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12455
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend, Frontend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: performance
>
> Currently Impala aggregates bloom filters from different instances of the 
> join builder by OR-ing them to a final filter. This could be avoided by 
> having num_instances smaller bloom filters and choosing the correct one 
> during lookup by doing the same hashing as used in partitioning. Builders 
> would only need to write a single small filter as they have only keys from a 
> single partition. This would make runtime filter producers faster and much 
> more scalable while shouldn't have major effect on consumers.
> One caveat is that we push down the current bloom filter to Kudu as it is, so 
> this optimization wouldn't be applicable in filters consumed by Kudu scans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to