[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518206#comment-16518206 ]
Liang-Chi Hsieh commented on SPARK-24607: ----------------------------------------- Thanks [~mgaido]! As I check {{Rand}} expression, seems it already uses a seed, it sounds like the option 1 in the Hive ticket. So if you still use the same query plan, it should produce deterministic sequences of numbers. > Distribute by rand() can lead to data inconsistency > --------------------------------------------------- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0, 2.3.1 > Reporter: zenglinxi > Priority: Major > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, I think it's also an hidden > serious problem in sparksql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org