[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

Rui Li (JIRA) Wed, 23 May 2018 05:15:49 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487148#comment-16487148
 ]


Rui Li commented on HIVE-19671:
-------------------------------

I haven't verified it but my guess is the issue happens with task failover. 
Suppose mappers of {{distribute by}} finish successfully. Then reducers start 
but fail to fetch shuffle data because some nodes hosting mapper output are 
lost. Then those mappers are retried. But since partition keys are randomly 
generated, the retried tasks can produce different partitions than the previous 
attempt, which leads to the inconsistency.

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: HIVE-19671
>                 URL: https://issues.apache.org/jira/browse/HIVE-19671
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand());
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

Reply via email to