[
https://issues.apache.org/jira/browse/HIVE-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902814#comment-14902814
]
Ashutosh Chauhan commented on HIVE-11132:
-----------------------------------------
This optimization has an assumption that join cannot reorder its output rows,
which may not always be true. In particular, Tez's custom edge and custom
simple edge. Also, reduce-side hash joins.
At the time where this optimization is attempted (i.e., in logical optimization
phase) there is not sufficient info to determine this. I think we should
migrate this optimization from logical to physical layer. Thoughts?
> Queries using join and group by produce incorrect output when
> hive.auto.convert.join=false and hive.optimize.reducededuplication=true
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-11132
> URL: https://issues.apache.org/jira/browse/HIVE-11132
> Project: Hive
> Issue Type: Bug
> Components: Logical Optimizer
> Affects Versions: 0.14.0, 1.0.0, 1.2.0, 1.1.0
> Reporter: Rich Haase
> Assignee: Ashutosh Chauhan
> Attachments: HIVE-11132.patch
>
>
> Queries using join and group by produce multiple output rows with the same
> key when hive.auto.convert.join=false and
> hive.optimize.reducededuplication=true. This interaction between
> configuration parameters is unexpected and should be well documented at the
> very least and should likely be considered a bug.
> e.g.
> hive> set hive.auto.convert.join = false;
> hive> set hive.optimize.reducededuplication = true;
> hive> SELECT foo.id, count(*) as factor
> > FROM foo
> > JOIN bar ON (foo.id = bar.id and foo.line_id = bar.line_id)
> > JOIN split ON (foo.id = split.id and foo.line_id = split.line_id)
> > JOIN forecast ON (foo.id = forecast.id AND foo.line_id =
> forecast.line_id)
> > WHERE foo.order != ‘blah’ AND foo.id = ‘XYZ'
> > GROUP BY foo.id;
> XYZ 79
> XYZ 74
> XYZ 297
> XYZ 66
> hive> set hive.auto.convert.join = true;
> hive> set hive.optimize.reducededuplication = true;
> hive> SELECT foo.id, count(*) as factor
> > FROM foo
> > JOIN bar ON (foo.id = bar.id and foo.line_id = bar.line_id)
> > JOIN split ON (foo.id = split.id and foo.line_id = split.line_id)
> > JOIN forecast ON (foo.id = forecast.id AND foo.line_id =
> forecast.line_id)
> > WHERE foo.order != ‘blah’ AND foo.id = ‘XYZ'
> > GROUP BY foo.id;
> XYZ 516
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)