[
https://issues.apache.org/jira/browse/HIVE-24033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176324#comment-17176324
]
Sebastian Klemke commented on HIVE-24033:
-----------------------------------------
Execution plan of the failing query is here: [^failing_query_plan.txt]
joinreducededuplication optimizer logs for this query:
{code:java}
2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main]
correlation.ReduceSinkJoinDeDuplication: Set RS[21] to forward data
2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main]
correlation.ReduceSinkJoinDeDuplication: Set RS[20] to FIXED parallelism: 120
2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main]
correlation.ReduceSinkJoinDeDuplication: Set RS[21] to FIXED parallelism: 120
2020-07-31T14:42:41,542 DEBUG [89354899-5041-441a-ab6f-41e4eb1d3625 main]
correlation.ReduceSinkJoinDeDuplication: Set RS[17] to FIXED parallelism: 120
{code}
> full outer join returns wrong number of results if
> hive.optimize.joinreducededuplication is enabled
> ---------------------------------------------------------------------------------------------------
>
> Key: HIVE-24033
> URL: https://issues.apache.org/jira/browse/HIVE-24033
> Project: Hive
> Issue Type: Bug
> Reporter: Sebastian Klemke
> Priority: Major
> Attachments: failing_query_plan.txt
>
>
> We encountered a hive query that returns incorrect results, when joining two
> CTEs on a group by value. The input tables `id_table` and
> `reference_table` are unfortunately too large to share and on smaller tables
> we have not been able to reproduce.
> {code}
> WITH ids AS (
> SELECT
> record.id AS id
> FROM
> `id_table`
> LATERAL VIEW explode(records) r AS record
> WHERE
> record.id = '5ef0bad74d325f72f0360c19'
> LIMIT 1
> ),
> refs AS (
> SELECT
> reference['id'] AS referenceId
> FROM
> `reference_table`
> WHERE
> partition_date = '2020-06-24'
> AND type = '1b0e9eb5c492d1859815410253dd79b5'
> AND reference['id'] = '5ef0bad74d325f72f0360c19'
> GROUP BY
> reference['id']
> )
> SELECT
> l.id AS id
> , r.referenceId AS referenceId
> FROM
> ids l
> FULL OUTER JOIN
> refs r
> ON
> l.id = r.referenceId
> {code}
> This returns 2 rows, because the join clause misses:
> {code}
> OK
> 5ef0bad74d325f72f0360c19 NULL
> NULL 5ef0bad74d325f72f0360c19
> {code}
> Instead, a single row should be returned. The correct behavior can be
> achieved by either
> * calling lower() on the refs group by statement (doesn't change the string
> contents)
> * setting hive.optimize.joinreducededuplication=false
--
This message was sent by Atlassian Jira
(v8.3.4#803005)