[jira] [Commented] (HIVE-13019) Optimizer COLLECT_LIST/COLLECT_SET

Gopal V (JIRA) Sat, 06 Feb 2016 15:45:06 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136061#comment-15136061
 ]


Gopal V commented on HIVE-13019:
--------------------------------

The inner join prevents a bit of the push-down, because Hive doesn't record 
primary-key/foreign-key relationships between cid <-> oid.

set hive.transpose.aggr.join=true; in hive-2.0 will do that, assuming the CBO 
cost model infers it to a PK/FK join.

if it is not a PK/PK - inner joins can actually explode if it's an MxN join 
before the group-by gets applied and that will completely change the way 
collect_list() works based on the join conditions (i.e if there are 2 customer 
rows for cid=1 and 100 order rows for oid=1, then the output has 200 rows) - if 
you push the group-by through, then it's possible you end up with exactly 1 
order row for the oid=1 resulting in a different result inside the list.

Beyond that, this is a common pattern used by PIG which actually can spill a 
BAG, while Hive cannot spill part of a row if a single row collect_list() goes 
to multiple megabytes - usually "null user" in click-streams.

Occasionally, I end up having to rewrite such collect_list() queries which were 
written before Hive officially had the necessary OLAP operators (FIRST_VALUE, 
LAST_VALUE).

> Optimizer COLLECT_LIST/COLLECT_SET 
> -----------------------------------
>
>                 Key: HIVE-13019
>                 URL: https://issues.apache.org/jira/browse/HIVE-13019
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Logical Optimizer
>            Reporter: Dustin Cote
>            Priority: Minor
>
> Currently when using a COLLECT_SET/COLLECT_LIST that involves data from a 
> single table, the aggregation is done after any JOIN operation that is 
> present in the query.  For example:
> {code}
> insert into table nested_customers_orders
> select c.*, collect_list(named_struct("oid", o.oid, "order_date": o.date...))
> from customers c inner join orders o on (c.cid = o.oid)
> group by o.oid, o.date,...
> {code}
> If we can tell the optimizer to perform the COLLECT_LIST first (where 
> possible) we can see some performance gains in this pattern of query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-13019) Optimizer COLLECT_LIST/COLLECT_SET

Reply via email to