[ 
https://issues.apache.org/jira/browse/IMPALA-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Rodoni updated IMPALA-2663:
--------------------------------
    Docs Text:   (was: I'd expect this to improve the performance of many 
common queries that reference and filter nested collections.)

> Filter out tuples with empty collection slots in scan.
> ------------------------------------------------------
>
>                 Key: IMPALA-2663
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2663
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.3.0
>            Reporter: Alexander Behm
>            Assignee: Alexander Behm
>            Priority: Major
>              Labels: nested_types, performance
>             Fix For: Impala 2.5.0
>
>
> For queries that reference nested collections, we assign predicates that 
> reference fields of the nested collection directly in the corresponding 
> parent scan. Those predicates affect how many items are materialized inside a 
> collection-typed slot. There's a good chance that such predicates result in 
> empty collections, but we currently do not filter out the containing tuple if 
> there are such empty collections. Consider this query and its plan:
> {code}
> Query:
> select c_custkey, o_orderkey
> from tpch_nested_parquet.customer c, c.c_orders
> where o_orderkey = 1884930
> Plan:
> +------------------------------------------------------------------------------------+
> | Explain String                                                              
>        |
> +------------------------------------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=176.00MB VCores=1                   
>        |
> | WARNING: The following tables are missing relevant table and/or column 
> statistics. |
> | tpch_nested_parquet.customer                                                
>        |
> |                                                                             
>        |
> | 05:EXCHANGE [UNPARTITIONED]                                                 
>        |
> | |                                                                           
>        |
> | 01:SUBPLAN                                                                  
>        |
> | |                                                                           
>        |
> | |--04:NESTED LOOP JOIN [CROSS JOIN]                                         
>        |
> | |  |                                                                        
>        |
> | |  |--02:SINGULAR ROW SRC                                                   
>        |
> | |  |                                                                        
>        |
> | |  03:UNNEST [c.c_orders]                                                   
>        |
> | |                                                                           
>        |
> | 00:SCAN HDFS [tpch_nested_parquet.customer c]                               
>        |
> |    partitions=1/1 files=4 size=554.13MB                                     
>        |
> |    predicates on c_orders: o_orderkey = 1884930                             
>        |
> +------------------------------------------------------------------------------------+
> {code}
> In the query above, the scan returns one row for every customer. It is clear 
> though that only few customer rows will result in any output from the 
> subplan. Avoiding subplan iterations is important because they are expensive 
> even when unnesting empty collections (due to Reset() of the plan tree).
> For nested TPCH Q12 on my dev setup this improvement reduced the number of 
> rows returned from the scan by >100x, from 1.5M to 13K.
> It seems that the following nested TPCH queries could also benefit from this 
> improvement:
> Q3, Q4, Q5, Q7, Q8, Q10, Q12, Q13, Q21
> When considering this optimization, special consideration must be given to 
> outer or semi joined nested collections because then applying this 
> optimization may not be correct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to