[
https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767601#comment-17767601
]
Riza Suminto edited comment on IMPALA-12455 at 9/21/23 3:00 PM:
----------------------------------------------------------------
I suspect this strategy can cost more in terms of memory space?
I hack bloom-filter-test to get some number using TPCD-DS Q95 as example.
{code:java}
// TPC-DS 3TB has NDV(ws_order_number) = 176332106
int ws_order_number_min_size = BloomFilter::MinLogSpace(176332106, 0.75);
EXPECT_EQ(26, ws_order_number_min_size);
// Assume we have 512 num_instances
int ws_order_number_min_size_div_512 = BloomFilter::MinLogSpace(344398, 0.75);
EXPECT_EQ(17, ws_order_number_min_size_div_512); {code}
In this case, having 1 filter of size 2^26 (64MB) will cost the same memory as
having 512 filters of size 2^17 (128KB).
But Impala has RUNTIME_FILTER_MAX_SIZE of 2^24 (16MB), so it will cost 2^4 more
memory space per consumer to have 512 filters of size 2^17 instead of 1 filter
of 2^24 (capped at RUNTIME_FILTER_MAX_SIZE)?
was (Author: rizaon):
I suspect this strategy can cost more in terms of memory space?
I have bloom-filter-test to get some number using TPCD-DS Q95 as example.
{code:java}
// TPC-DS 3TB has NDV(ws_order_number) = 176332106
int ws_order_number_min_size = BloomFilter::MinLogSpace(176332106, 0.75);
EXPECT_EQ(26, ws_order_number_min_size);
// Assume we have 512 num_instances
int ws_order_number_min_size_div_512 = BloomFilter::MinLogSpace(344398, 0.75);
EXPECT_EQ(17, ws_order_number_min_size_div_512); {code}
In this case, having 1 filter of size 2^26 (64MB) will cost the same memory as
having 512 filters of size 2^17 (128KB).
But Impala has RUNTIME_FILTER_MAX_SIZE of 2^24 (16MB), so it will cost 2^4 more
memory space per consumer to have 512 filters of size 2^17 instead of 1 filter
of 2^24 (capped at RUNTIME_FILTER_MAX_SIZE)?
> Create set of disjunct bloom filters for keys in partitioned builds
> -------------------------------------------------------------------
>
> Key: IMPALA-12455
> URL: https://issues.apache.org/jira/browse/IMPALA-12455
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend, Frontend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: performance
>
> Currently Impala aggregates bloom filters from different instances of the
> join builder by OR-ing them to a final filter. This could be avoided by
> having num_instances smaller bloom filters and choosing the correct one
> during lookup by doing the same hashing as used in partitioning. Builders
> would only need to write a single small filter as they have only keys from a
> single partition. This would make runtime filter producers faster and much
> more scalable while shouldn't have major effect on consumers.
> One caveat is that we push down the current bloom filter to Kudu as it is, so
> this optimization wouldn't be applicable in filters consumed by Kudu scans.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]