[
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493065#comment-16493065
]
Rui Li commented on HIVE-19671:
-------------------------------
Verified the issue only happens when there're task retries. I can think of two
possible solutions:
# Use rand(seed) instead of rand(). Rand(seed) is supposed to generate
deterministic sequences so the retried task will have the same partition as
original one. The prerequisite is the sequence of method calls is the same, and
the task input is in deterministic order.
# Disable task retry if the shuffle partition key is non-deterministic, using
configs like {{mapreduce.map.maxattempts}},
{{tez.am.task.max.failed.attempts}}, {{spark.task.maxFailures}}.
[~gopalv], [~xuefuz], [~ashutoshc], do you have any suggestions? Thanks.
> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
> Issue Type: Bug
> Reporter: Rui Li
> Assignee: Rui Li
> Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)