[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

Rui Li (JIRA) Mon, 28 May 2018 20:46:28 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493065#comment-16493065
 ]


Rui Li commented on HIVE-19671:
-------------------------------

Verified the issue only happens when there're task retries. I can think of two 
possible solutions:
 # Use rand(seed) instead of rand(). Rand(seed) is supposed to generate 
deterministic sequences so the retried task will have the same partition as 
original one. The prerequisite is the sequence of method calls is the same, and 
the task input is in deterministic order.
 # Disable task retry if the shuffle partition key is non-deterministic, using 
configs like {{mapreduce.map.maxattempts}}, 
{{tez.am.task.max.failed.attempts}}, {{spark.task.maxFailures}}.

[~gopalv], [~xuefuz], [~ashutoshc], do you have any suggestions? Thanks.

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: HIVE-19671
>                 URL: https://issues.apache.org/jira/browse/HIVE-19671
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Assignee: Rui Li
>            Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

Reply via email to