[
https://issues.apache.org/jira/browse/IMPALA-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200198#comment-17200198
]
Riza Suminto commented on IMPALA-10112:
---------------------------------------
Hi [~superdupershant], that is a good point. I'm good with removing this check
as well.
Two weeks ago, [~drorke] and I ran some runtime filter tuning experiment over
RPC-DS 10TB scale where we do combination of raising the min/max filter size,
lowering target FPP, and passing all runtime filter (basically removing this
check).
Among queries that use heavy runtime filter, we observe that most of them
obtain better or similar performance compared to the baseline (default config).
In the baseline runs, I found that filter that is being replaced by
ALWAYS_TRUE_FILTER by this check were all small filters that are <= 16kb. So I
think it is quite harmless to let them propagate anyway and let the scanners
disable it by themself if it turns out ineffective.
We do saw some small regression among few queries that we're not specifically
targeting. We suspect that it is due to long query unregistering activity
rather than filter propagation.
Unfortunately, we can not say conclusively since our cluster reservation has
ran out.
> Consider skipping FpRateTooHigh() check for bloom filters
> ---------------------------------------------------------
>
> Key: IMPALA-10112
> URL: https://issues.apache.org/jira/browse/IMPALA-10112
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
> Labels: performance
>
> This check disables bloom filters on the sender side.
> It is inaccurate in cases where there are duplicate values of the filter key
> on the build side. E.g. many-to-many join or a join with multiple keys. This
> could be fixed with some effort, but is probably not worth it, because:
> * Partition filters are probably still worth evaluating even if there are
> false positives, because it's cheap and eliminating a partition is still
> beneficial.
> * Runtime filters are dynamically disabled on the scan side if they are
> ineffective. I think we still also "evaluate" the always true filter, which
> is cheaper than doing the hashing and bloom evaluation, but still not
> entirely free.
> * The disabling is fairly unlikely to kick in for partitioned joins because
> it's only applied to a small subset of the filter, before the Or() operation.
> So it's potentially harmful and only likely beneficial for broadcast join
> filters, in which case it saves a small amount of scan CPU and, for global
> filters, coordinator RPCs and broadcasting. It's unclear that the complexity
> is worth it for this relatively small and uncertain benefit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]