[jira] [Comment Edited] (SPARK-24607) Distribute by rand() can lead to data inconsistency

Kent Yao (Jira) Thu, 03 Dec 2020 22:18:45 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243743#comment-17243743
 ]


Kent Yao edited comment on SPARK-24607 at 12/4/20, 6:17 AM:
------------------------------------------------------------

This could happen when the map stage retries, the same record that in the map 
task probably targets to different reduce tasks among task attempts.

 This could result in an incomplete result set when introducing a 
non-deterministic expression, e.g. rand(), in the jobs that need shuffle, e.g. 
aggregates, sort-merge join.

We may need a random but replayable function to handle these use cases because 
it is a common way that users use to deal with data skewness.

Otherwise, we may forbid non-deterministic functions to be used shuffle related 
operations.

cc [~cloud_fan] [~ulysses] [~maropu]


was (Author: qin yao):
This could happen when the map stage retries, the same record that in the map 
task probably targets to different reduce tasks among task attempts.

 his could result in an incomplete result set when introducing a 
non-deterministic expression, e.g. rand(), in the jobs that need shuffle, e.g. 
aggregates, sort-merge join.

We may need a random but replayable function to handle these use cases because 
it is a common way that users use to deal with data skewness.

Otherwise, we may forbid non-deterministic functions to be used shuffle related 
operations.

cc [~cloud_fan] [~ulysses] [~maropu]

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: SPARK-24607
>                 URL: https://issues.apache.org/jira/browse/SPARK-24607
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0, 2.3.1
>            Reporter: zenglinxi
>            Priority: Major
>              Labels: bulk-closed
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with 
> hiveSQL which include  distribute by rand, data inconsistency may happen 
> during failure tolerance operations. Since spark has similar failure 
> tolerance mechanism, I think it's also an hidden serious problem in sparksql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-24607) Distribute by rand() can lead to data inconsistency

Reply via email to