[jira] [Comment Edited] (HIVE-19690) multi-insert query with multiple GBY, and distinct in only some branches can produce incorrect results

Sergey Shelukhin (JIRA) Tue, 29 May 2018 17:15:09 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494516#comment-16494516
 ]


Sergey Shelukhin edited comment on HIVE-19690 at 5/30/18 12:14 AM:
-------------------------------------------------------------------

Fixed most cases to be supported again. Seems like some existing cases that are 
seemingly valid were disabled (e.g. in multi_insert_gby3)... 
I'm not exactly sure what distinguishes the existing cases from the new one 
that produces incorrect results without the change. In the new 3-column test 
case, the result actually contains "random" existing column values in incorrect 
places (instead of failing). presumably because non-distinct GBY is not ready 
to process distinct-GBY-style rows (with structure changed by ReduceSink 
distinct handling logic), so it processes them incorrectly but doesn't fail. I 
wonder if old results were just correct by coincidence, because there are few 
columns and some aggregates are a no-op (one row per key), so they picked the 
correct values "accidentally", where the more complicated new case doesn't.
Seems like the old cases that are affected are very specific and rarte, and so 
it should be ok to disable the optimization for them... unless 
[~jcamachorodriguez] [~ashutoshc] have some input :)

Also, I didn't run all the tests on all the drivers, will update out files 
after another HiveQA.
RB: https://reviews.apache.org/r/67368/


was (Author: sershe):
Fixed most cases to be supported again. Seems like some existing cases that are 
seemingly valid were disabled (e.g. in multi_insert_gby3)... 
I'm not exactly sure what distinguishes the existing cases from the new one 
that produces incorrect results without the change. In the new 3-column test 
case, the result actually contains "random" existing column values in incorrect 
places (instead of failing). presumably because non-distinct GBY is not ready 
to process distinct-GBY-style rows (with structure changed by ReduceSink 
distinct handling logic), so it processes them incorrectly but doesn't fail. I 
wonder if old results were just correct by coincidence, because there are few 
columns and some aggregates are a no-op (one row per key), so they picked the 
correct values "accidentally", where the more complicated new case doesn't.
Seems like the old cases that are affected are very specific and rarte, and so 
it should be ok to disable the optimization for them... unless 
[~jcamachorodriguez] [~ashutoshc] have some input :)

Also, I didn't run all the tests on all the drivers, will update out files 
after another HiveQA.
Will post RB momentarily

> multi-insert query with multiple GBY, and distinct in only some branches can 
> produce incorrect results
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-19690
>                 URL: https://issues.apache.org/jira/browse/HIVE-19690
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Riju Trivedi
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HIVE-19690.01.patch, HIVE-19690.02.patch, 
> HIVE-19690.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-19690) multi-insert query with multiple GBY, and distinct in only some branches can produce incorrect results

Reply via email to