[ 
https://issues.apache.org/jira/browse/IMPALA-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shant Hovsepian updated IMPALA-2929:
------------------------------------
    Labels: complextype nested_types performance query_generator supportability 
usability  (was: nested_types performance query_generator supportability 
usability)

> Add a hint to eliminate self-join for BI systems that don't support nested 
> types yet
> ------------------------------------------------------------------------------------
>
>                 Key: IMPALA-2929
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2929
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: impala 2.3
>         Environment: CDH 5.5; Impala 2.3
>            Reporter: Ruslan Dautkhanov
>            Priority: Minor
>              Labels: complextype, nested_types, performance, query_generator, 
> supportability, usability
>
> We tested today a workaround for a BI tool to fetch nested data - created an 
> Impala view for a table with nested collection.
> OBIEE (and many other BI tools) require to define "dimension" tables (like a 
> customer dimensions), and facts (for example customer transactions). With a 
> table that has a nested collection (fact), all non-nested attributes make up 
> a customer dimension. And nested collection is a fact.
> So when we created a view that, e.g.
> {code:sql}
> CREATE VIEW adash_mrt.amf_trans_fact_vw 
> AS
> select fact.individ, trans.*
> from adash_mrt.amf_tsp_fact fact
>    , fact.amft trans;
> {code}
> And presented amf_trans_fact_vw to OBIEE as a fact, and 
> adash_mrt.amf_tsp_fact as a customer dimension. 
> So BI tools like OBIEE then always would try to join that dimension and a 
> fact, for example like 
> {code:sql}
> SELECT count(dim.individ), sum(trans.activity_dollars)
> FROM amf_tsp_fact dim, amf_trans_fact_vw trans
> WHERE dim.individ = trans.individ 
>       and trans.activity_type='M'  
>       and dim.tsp_gender='2'  
>       and dim.final_ace_city='SUPERIOR';
> {code}
> Then Impala would actually self-join that table amf_tsp_fact with a nested 
> collection amft (!), rendering all benefits of nested collection useless. 
> Because main benefit is avoiding expensive joins, but Impala's query 
> optimizer not that smart to avoid that self-join.. 
> This JIRA is a suggestion to introduce a query-level hint or a new "set" 
> session-level parameter that would eliminate such a self-join for BI tools 
> that do not support nested types directly. It might be a hint in that CREATE 
> VIEW query too.
> Here's explain plan for the above query:
> {noformat}
> 10:AGGREGATE [FINALIZE]
> |  output: count:merge(dim.individ), 
> sum:merge(CAST(dim.activity_total_dollars_spent AS BIGINT)), 
> sum:merge(trans.activity_dollars)
> |
> 09:EXCHANGE [UNPARTITIONED]
> |
> 07:AGGREGATE
> |  output: count(dim.individ), sum(trans.item.activity_dollars)
> |
> 06:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: fact.individ = dim.individ
> |
> |--08:EXCHANGE [BROADCAST]
> |  |
> |  00:SCAN HDFS [adash_mrt.amf_tsp_fact dim]
> |     partitions=1/1 files=6500 size=291.37GB
> |     predicates: dim.tsp_gender = '2', dim.final_ace_city = 'SUPERIOR'
> |
> 02:SUBPLAN
> |
> |--05:NESTED LOOP JOIN [CROSS JOIN]
> |  |
> |  |--03:SINGULAR ROW SRC
> |  |
> |  04:UNNEST [fact.amft trans]
> |
> 01:SCAN HDFS [adash_mrt.amf_tsp_fact fact]
>    partitions=1/1 files=6500 size=291.37GB
>    predicates on trans: trans.item.activity_type = 'M'
> {noformat}
> Notice HASH JOIN. 
> We implemented nested data types at the data lake layer, but stuck with BI 
> tools because they don't support nested collections yet. It seems a low 
> hanging fruit to close the gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to