[ https://issues.apache.org/jira/browse/IMPALA-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Rodoni updated IMPALA-2663: -------------------------------- Docs Text: (was: I'd expect this to improve the performance of many common queries that reference and filter nested collections.) > Filter out tuples with empty collection slots in scan. > ------------------------------------------------------ > > Key: IMPALA-2663 > URL: https://issues.apache.org/jira/browse/IMPALA-2663 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 2.3.0 > Reporter: Alexander Behm > Assignee: Alexander Behm > Priority: Major > Labels: nested_types, performance > Fix For: Impala 2.5.0 > > > For queries that reference nested collections, we assign predicates that > reference fields of the nested collection directly in the corresponding > parent scan. Those predicates affect how many items are materialized inside a > collection-typed slot. There's a good chance that such predicates result in > empty collections, but we currently do not filter out the containing tuple if > there are such empty collections. Consider this query and its plan: > {code} > Query: > select c_custkey, o_orderkey > from tpch_nested_parquet.customer c, c.c_orders > where o_orderkey = 1884930 > Plan: > +------------------------------------------------------------------------------------+ > | Explain String > | > +------------------------------------------------------------------------------------+ > | Estimated Per-Host Requirements: Memory=176.00MB VCores=1 > | > | WARNING: The following tables are missing relevant table and/or column > statistics. | > | tpch_nested_parquet.customer > | > | > | > | 05:EXCHANGE [UNPARTITIONED] > | > | | > | > | 01:SUBPLAN > | > | | > | > | |--04:NESTED LOOP JOIN [CROSS JOIN] > | > | | | > | > | | |--02:SINGULAR ROW SRC > | > | | | > | > | | 03:UNNEST [c.c_orders] > | > | | > | > | 00:SCAN HDFS [tpch_nested_parquet.customer c] > | > | partitions=1/1 files=4 size=554.13MB > | > | predicates on c_orders: o_orderkey = 1884930 > | > +------------------------------------------------------------------------------------+ > {code} > In the query above, the scan returns one row for every customer. It is clear > though that only few customer rows will result in any output from the > subplan. Avoiding subplan iterations is important because they are expensive > even when unnesting empty collections (due to Reset() of the plan tree). > For nested TPCH Q12 on my dev setup this improvement reduced the number of > rows returned from the scan by >100x, from 1.5M to 13K. > It seems that the following nested TPCH queries could also benefit from this > improvement: > Q3, Q4, Q5, Q7, Q8, Q10, Q12, Q13, Q21 > When considering this optimization, special consideration must be given to > outer or semi joined nested collections because then applying this > optimization may not be correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org