[
https://issues.apache.org/jira/browse/IMPALA-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186026#comment-17186026
]
Tim Armstrong commented on IMPALA-10112:
----------------------------------------
[~rizaon] [~superdupershant] I wouldn't mind your input on my thinking here. I
thinking about just removing this check -
https://github.com/apache/impala/blob/6c8a3dfc339e43a8992af2ff3429ba5940a061ec/be/src/exec/partitioned-hash-join-builder.cc#L939
. I'll explain what I'm thinking but hope you can poke holes in it if there
are any.
I had thought about it a bit but I can't think of a scenario where this
FpRateTooHigh() check is really important - it could have a marginal benefit
sometimes, but I don't think enough such that we would really be worried about
regressing queries.
The argument for removing it is that it has both a high false negative rate
(i.e. final filter is ineffective but it gets through the check) and can return
false positives (i.e. final filter would be useful but gets rejected by the
check).
> Consider skipping FpRateTooHigh() check for bloom filters
> ---------------------------------------------------------
>
> Key: IMPALA-10112
> URL: https://issues.apache.org/jira/browse/IMPALA-10112
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
> Labels: performance
>
> This check disables bloom filters on the sender side.
> It is inaccurate in cases where there are duplicate values of the filter key
> on the build side. E.g. many-to-many join or a join with multiple keys. This
> could be fixed with some effort, but is probably not worth it, because:
> * Partition filters are probably still worth evaluating even if there are
> false positives, because it's cheap and eliminating a partition is still
> beneficial.
> * Runtime filters are dynamically disabled on the scan side if they are
> ineffective. I think we still also "evaluate" the always true filter, which
> is cheaper than doing the hashing and bloom evaluation, but still not
> entirely free.
> * The disabling is fairly unlikely to kick in for partitioned joins because
> it's only applied to a small subset of the filter, before the Or() operation.
> So it's potentially harmful and only likely beneficial for broadcast join
> filters, in which case it saves a small amount of scan CPU and, for global
> filters, coordinator RPCs and broadcasting. It's unclear that the complexity
> is worth it for this relatively small and uncertain benefit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]