[
https://issues.apache.org/jira/browse/HIVE-10996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jesus Camacho Rodriguez updated HIVE-10996:
-------------------------------------------
Attachment: HIVE-10996.02.patch
> Aggregation / Projection over Multi-Join Inner Query producing incorrect
> results
> --------------------------------------------------------------------------------
>
> Key: HIVE-10996
> URL: https://issues.apache.org/jira/browse/HIVE-10996
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 1.0.0, 1.2.0, 1.1.0, 1.3.0, 2.0.0
> Reporter: Gautam Kowshik
> Assignee: Jesus Camacho Rodriguez
> Priority: Critical
> Attachments: HIVE-10996.01.patch, HIVE-10996.02.patch,
> HIVE-10996.patch, explain_q1.txt, explain_q2.txt
>
>
> We see the following problem on 1.1.0 and 1.2.0 but not 0.13 which seems like
> a regression.
> The following query (Q1) produces no results:
> {code}
> select s
> from (
> select last.*, action.st2, action.n
> from (
> select purchase.s, purchase.timestamp, max (mevt.timestamp) as
> last_stage_timestamp
> from (select * from purchase_history) purchase
> join (select * from cart_history) mevt
> on purchase.s = mevt.s
> where purchase.timestamp > mevt.timestamp
> group by purchase.s, purchase.timestamp
> ) last
> join (select * from events) action
> on last.s = action.s and last.last_stage_timestamp = action.timestamp
> ) list;
> {code}
> While this one (Q2) does produce results :
> {code}
> select *
> from (
> select last.*, action.st2, action.n
> from (
> select purchase.s, purchase.timestamp, max (mevt.timestamp) as
> last_stage_timestamp
> from (select * from purchase_history) purchase
> join (select * from cart_history) mevt
> on purchase.s = mevt.s
> where purchase.timestamp > mevt.timestamp
> group by purchase.s, purchase.timestamp
> ) last
> join (select * from events) action
> on last.s = action.s and last.last_stage_timestamp = action.timestamp
> ) list;
> 1 21 20 Bob 1234
> 1 31 30 Bob 1234
> 3 51 50 Jeff 1234
> {code}
> The setup to test this is:
> {code}
> create table purchase_history (s string, product string, price double,
> timestamp int);
> insert into purchase_history values ('1', 'Belt', 20.00, 21);
> insert into purchase_history values ('1', 'Socks', 3.50, 31);
> insert into purchase_history values ('3', 'Belt', 20.00, 51);
> insert into purchase_history values ('4', 'Shirt', 15.50, 59);
> create table cart_history (s string, cart_id int, timestamp int);
> insert into cart_history values ('1', 1, 10);
> insert into cart_history values ('1', 2, 20);
> insert into cart_history values ('1', 3, 30);
> insert into cart_history values ('1', 4, 40);
> insert into cart_history values ('3', 5, 50);
> insert into cart_history values ('4', 6, 60);
> create table events (s string, st2 string, n int, timestamp int);
> insert into events values ('1', 'Bob', 1234, 20);
> insert into events values ('1', 'Bob', 1234, 30);
> insert into events values ('1', 'Bob', 1234, 25);
> insert into events values ('2', 'Sam', 1234, 30);
> insert into events values ('3', 'Jeff', 1234, 50);
> insert into events values ('4', 'Ted', 1234, 60);
> {code}
> I realize select * and select s are not all that interesting in this context
> but what lead us to this issue was select count(distinct s) was not returning
> results. The above queries are the simplified queries that produce the issue.
> I will note that if I convert the inner join to a table and select from that
> the issue does not appear.
> Update: Found that turning off hive.optimize.remove.identity.project fixes
> this issue. This optimization was introduced in
> https://issues.apache.org/jira/browse/HIVE-8435
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)