[jira] [Commented] (IMPALA-10112) Consider skipping FpRateTooHigh() check for bloom filters

Shant Hovsepian (Jira) Mon, 21 Sep 2020 17:39:46 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199717#comment-17199717
 ]


Shant Hovsepian commented on IMPALA-10112:
------------------------------------------

Hi [~tarmstrong]  [~rizaon] just noticed this, given this is for partitioned 
hash joins I'd say the code as written is probably doing more damage than good. 
It's incorrect for most practical cases with multiple columns in the join key 
and it's assuming a uniform distribution across the build partitions as well as 
not applying some correction to factor in build_rows vs NDV. Also at this point 
the CPU overhead of computing the bloom filters has already happened. We could 
probably do better check at plan time if we had some stats.

In the case where the filter size is very large, and the coordinator overhead 
would be worst, I just don't know how easily this check will kick in without 
adjusting the build_rows for the NDV, so I imagine it rarely helps in the case 
where it is meant to especially with our current FPP at 0.75.

I'm good with removing it. I think the scan side check is the bulk of the 
performance benefit, if anything we might be able to be smarter at plan time to 
disable these in the couple cases where it makes sense.

> Consider skipping FpRateTooHigh() check for bloom filters
> ---------------------------------------------------------
>
>                 Key: IMPALA-10112
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10112
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: performance
>
> This check disables bloom filters on the sender side.
> It is inaccurate in cases where there are duplicate values of the filter key 
> on the build side. E.g. many-to-many join or a join with multiple keys. This 
> could be fixed with some effort, but is probably not worth it, because:
> * Partition filters are probably still worth evaluating even if there are 
> false positives, because it's cheap and eliminating a partition is still 
> beneficial.
> * Runtime filters are dynamically disabled on the scan side if they are 
> ineffective. I think we still also "evaluate" the always true filter, which 
> is cheaper than doing the hashing and bloom evaluation, but still not 
> entirely free.
> * The disabling is fairly unlikely to kick in for partitioned joins because 
> it's only applied to a small subset of the filter, before the Or() operation.
> So it's potentially harmful and only likely beneficial for broadcast join 
> filters, in which case it saves a small amount of scan CPU and, for global 
> filters, coordinator RPCs and broadcasting. It's unclear that the complexity 
> is worth it for this relatively small and uncertain benefit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-10112) Consider skipping FpRateTooHigh() check for bloom filters

Reply via email to