[
https://issues.apache.org/jira/browse/IMPALA-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong updated IMPALA-10112:
-----------------------------------
Description:
This check disables bloom filters on the sender side.
It is inaccurate in cases where there are duplicate values of the filter key on
the build side. E.g. many-to-many join or a join with multiple keys. This could
be fixed with some effort, but is probably not worth it, because:
* Partition filters are probably still worth evaluating even if there are false
positives, because it's cheap and eliminating a partition is still beneficial.
* Runtime filters are dynamically disabled on the scan side if they are
ineffective. I think we still also "evaluate" the always true filter, which is
cheaper than doing the hashing and bloom evaluation, but still not entirely
free.
* The disabling is fairly unlikely to kick in for partitioned joins because
it's only applied to a small subset of the filter, before the Or() operation.
So it's potentially harmful and only likely beneficial for broadcast join
filters, in which case it saves a small amount of scan CPU and, for global
filters, coordinator RPCs and broadcasting. It's unclear that the complexity is
worth it for this relatively small and uncertain benefit.
was:
This check disables bloom filters on the sender side.
It is inaccurate in cases where there are duplicate values of the filter key on
the build side. E.g. many-to-many join or a join with multiple keys. This could
be fixed with some effort, but is probably not worth it, because:
* Partition filters are probably still worth evaluating even if there are false
positives, because it's cheap and eliminating a partition is still beneficial.
* Runtime filters are dynamically disabled on the scan side if they are
ineffective.
* The disabling is fairly unlikely to kick in for partitioned joins because
it's only applied to a small subset of the filter, before the Or() operation.
So it's potentially harmful and only likely beneficial for broadcast join
filters, in which case it saves a small amount of scan CPU and, for global
filters, coordinator RPCs and broadcasting. It's unclear that the complexity is
worth it for this relatively small and uncertain benefit.
> Consider skipping FpRateTooHigh() check for bloom filters
> ---------------------------------------------------------
>
> Key: IMPALA-10112
> URL: https://issues.apache.org/jira/browse/IMPALA-10112
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
> Labels: performance
>
> This check disables bloom filters on the sender side.
> It is inaccurate in cases where there are duplicate values of the filter key
> on the build side. E.g. many-to-many join or a join with multiple keys. This
> could be fixed with some effort, but is probably not worth it, because:
> * Partition filters are probably still worth evaluating even if there are
> false positives, because it's cheap and eliminating a partition is still
> beneficial.
> * Runtime filters are dynamically disabled on the scan side if they are
> ineffective. I think we still also "evaluate" the always true filter, which
> is cheaper than doing the hashing and bloom evaluation, but still not
> entirely free.
> * The disabling is fairly unlikely to kick in for partitioned joins because
> it's only applied to a small subset of the filter, before the Or() operation.
> So it's potentially harmful and only likely beneficial for broadcast join
> filters, in which case it saves a small amount of scan CPU and, for global
> filters, coordinator RPCs and broadcasting. It's unclear that the complexity
> is worth it for this relatively small and uncertain benefit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]