[
https://issues.apache.org/jira/browse/HIVE-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494516#comment-16494516
]
Sergey Shelukhin edited comment on HIVE-19690 at 5/30/18 12:14 AM:
-------------------------------------------------------------------
Fixed most cases to be supported again. Seems like some existing cases that are
seemingly valid were disabled (e.g. in multi_insert_gby3)...
I'm not exactly sure what distinguishes the existing cases from the new one
that produces incorrect results without the change. In the new 3-column test
case, the result actually contains "random" existing column values in incorrect
places (instead of failing). presumably because non-distinct GBY is not ready
to process distinct-GBY-style rows (with structure changed by ReduceSink
distinct handling logic), so it processes them incorrectly but doesn't fail. I
wonder if old results were just correct by coincidence, because there are few
columns and some aggregates are a no-op (one row per key), so they picked the
correct values "accidentally", where the more complicated new case doesn't.
Seems like the old cases that are affected are very specific and rarte, and so
it should be ok to disable the optimization for them... unless
[~jcamachorodriguez] [~ashutoshc] have some input :)
Also, I didn't run all the tests on all the drivers, will update out files
after another HiveQA.
RB: https://reviews.apache.org/r/67368/
was (Author: sershe):
Fixed most cases to be supported again. Seems like some existing cases that are
seemingly valid were disabled (e.g. in multi_insert_gby3)...
I'm not exactly sure what distinguishes the existing cases from the new one
that produces incorrect results without the change. In the new 3-column test
case, the result actually contains "random" existing column values in incorrect
places (instead of failing). presumably because non-distinct GBY is not ready
to process distinct-GBY-style rows (with structure changed by ReduceSink
distinct handling logic), so it processes them incorrectly but doesn't fail. I
wonder if old results were just correct by coincidence, because there are few
columns and some aggregates are a no-op (one row per key), so they picked the
correct values "accidentally", where the more complicated new case doesn't.
Seems like the old cases that are affected are very specific and rarte, and so
it should be ok to disable the optimization for them... unless
[~jcamachorodriguez] [~ashutoshc] have some input :)
Also, I didn't run all the tests on all the drivers, will update out files
after another HiveQA.
Will post RB momentarily
> multi-insert query with multiple GBY, and distinct in only some branches can
> produce incorrect results
> ------------------------------------------------------------------------------------------------------
>
> Key: HIVE-19690
> URL: https://issues.apache.org/jira/browse/HIVE-19690
> Project: Hive
> Issue Type: Bug
> Reporter: Riju Trivedi
> Assignee: Sergey Shelukhin
> Priority: Major
> Attachments: HIVE-19690.01.patch, HIVE-19690.02.patch,
> HIVE-19690.patch
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)