[ 
https://issues.apache.org/jira/browse/IMPALA-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-10112:
-----------------------------------
    Description: 
This check disables bloom filters on the sender side.

It is inaccurate in cases where there are duplicate values of the filter key on 
the build side. E.g. many-to-many join or a join with multiple keys. This could 
be fixed with some effort, but is probably not worth it, because:
* Partition filters are probably still worth evaluating even if there are false 
positives, because it's cheap and eliminating a partition is still beneficial.
* Runtime filters are dynamically disabled on the scan side if they are 
ineffective. I think we still also "evaluate" the always true filter, which is 
cheaper than doing the hashing and bloom evaluation, but still not entirely 
free.
* The disabling is fairly unlikely to kick in for partitioned joins because 
it's only applied to a small subset of the filter, before the Or() operation.

So it's potentially harmful and only likely beneficial for broadcast join 
filters, in which case it saves a small amount of scan CPU and, for global 
filters, coordinator RPCs and broadcasting. It's unclear that the complexity is 
worth it for this relatively small and uncertain benefit.



  was:
This check disables bloom filters on the sender side.

It is inaccurate in cases where there are duplicate values of the filter key on 
the build side. E.g. many-to-many join or a join with multiple keys. This could 
be fixed with some effort, but is probably not worth it, because:
* Partition filters are probably still worth evaluating even if there are false 
positives, because it's cheap and eliminating a partition is still beneficial.
* Runtime filters are dynamically disabled on the scan side if they are 
ineffective.
* The disabling is fairly unlikely to kick in for partitioned joins because 
it's only applied to a small subset of the filter, before the Or() operation.

So it's potentially harmful and only likely beneficial for broadcast join 
filters, in which case it saves a small amount of scan CPU and, for global 
filters, coordinator RPCs and broadcasting. It's unclear that the complexity is 
worth it for this relatively small and uncertain benefit.




> Consider skipping FpRateTooHigh() check for bloom filters
> ---------------------------------------------------------
>
>                 Key: IMPALA-10112
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10112
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: performance
>
> This check disables bloom filters on the sender side.
> It is inaccurate in cases where there are duplicate values of the filter key 
> on the build side. E.g. many-to-many join or a join with multiple keys. This 
> could be fixed with some effort, but is probably not worth it, because:
> * Partition filters are probably still worth evaluating even if there are 
> false positives, because it's cheap and eliminating a partition is still 
> beneficial.
> * Runtime filters are dynamically disabled on the scan side if they are 
> ineffective. I think we still also "evaluate" the always true filter, which 
> is cheaper than doing the hashing and bloom evaluation, but still not 
> entirely free.
> * The disabling is fairly unlikely to kick in for partitioned joins because 
> it's only applied to a small subset of the filter, before the Or() operation.
> So it's potentially harmful and only likely beneficial for broadcast join 
> filters, in which case it saves a small amount of scan CPU and, for global 
> filters, coordinator RPCs and broadcasting. It's unclear that the complexity 
> is worth it for this relatively small and uncertain benefit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to