[ 
https://issues.apache.org/jira/browse/IMPALA-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183482#comment-17183482
 ] 

Shant Hovsepian edited comment on IMPALA-10098 at 8/24/20, 5:31 PM:
--------------------------------------------------------------------

 [~tarmstrong] we have seen the full range of cardinalities. In the case of 
TPC-DS a common pattern is to find all "transactions not returned" which often 
is an ANTI JOIN or LEFT JOIN between two fact tables. The return rate is around 
1% in this synthetic case so at a 30TB scale factor the cardinality is close to 
100M. TPC-DS also has cases with item dimensions and NOT IN which are in the 
order of hundreds and thousands of unique values.

 

If we knew more about the domain of the values a common approach between 
sending concrete data values and bloom filters would be use a bitmap for 
primary keys or value ranges from min/max info in parquet.


was (Author: superdupershant):
 [~tarmstrong] we have seen the full range of cardinalities. In the case of 
TPC-DS a common pattern is to find all "transactions not returned" which often 
is an ANTI JOIN or LEFT JOIN between two fact tables. The return rate is around 
1% in this synthetic case so at a 30TB scale factor the cardinality is close to 
100M. TPC-DS also has cases with item dimensions and NOT IN which are in the 
order of hundreds and thousands of unique values.

> Runtime Filters for Set Exclusion or Compliment
> -----------------------------------------------
>
>                 Key: IMPALA-10098
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10098
>             Project: IMPALA
>          Issue Type: New Feature
>            Reporter: Shant Hovsepian
>            Priority: Major
>              Labels: runtime-filters
>
> It would be beneficial to extend runtime filters to push set exclusion down 
> to scan nodes. This would be used to optimize NOT IN, EXCEPT style queries or 
> more generally ANTI JOINS, as well as OUTER JOINs which filter out non null 
> attributes from the nullable side.
> This is almost the inverse operation of a traditional bloom filter, other 
> data structures might be more efficient.
> This would also compliment Impala's left deep pipelined query planning very 
> well for what otherwise would require complex query plans due to reordering 
> restrictions with ANTI/OUTER joins. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to