[
https://issues.apache.org/jira/browse/SPARK-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600400#comment-14600400
]
Michael Armbrust commented on SPARK-8599:
-----------------------------------------
What about this case?
Random DF
{code}
val df = sqlContext.range(1, 10).select($"id", rand(0).as('r))
df.show()
+--+-------------------+
|id| r|
+--+-------------------+
| 1|0.47027138530546275|
| 2|0.11616379100300933|
| 3|0.45008521832568693|
| 4| 0.9959647025839259|
| 5| 0.6038577325006693|
| 6| 0.6319470735268434|
| 7|0.22327628846133507|
| 8|0.24223739932588373|
| 9| 0.8395518879513995|
+--+-------------------+
{code}
Joins work as expected...
{code}
val df = sqlContext.range(1, 10).select($"id", rand(0).as('r))
df.as("a").join(df.as("b"), $"a.id" === $"b.id").show()
+--+-------------------+--+-------------------+
|id| r|id| r|
+--+-------------------+--+-------------------+
| 1|0.47027138530546275| 1|0.47027138530546275|
| 2|0.11616379100300933| 2|0.11616379100300933|
| 3|0.45008521832568693| 3|0.45008521832568693|
| 4| 0.9959647025839259| 4| 0.9959647025839259|
| 5| 0.6038577325006693| 5| 0.6038577325006693|
| 6| 0.6319470735268434| 6| 0.6319470735268434|
| 7|0.22327628846133507| 7|0.22327628846133507|
| 8|0.24223739932588373| 8|0.24223739932588373|
| 9| 0.8395518879513995| 9| 0.8395518879513995|
+--+-------------------+--+-------------------+
{code}
But this is kind of confusing...
{code}
val df = sqlContext.range(1, 10).select($"id", rand(0).as('r))
df.as("a").join(df.filter($"r" < 0.5).as("b"), $"a.id" === $"b.id").show()
+--+-------------------+--+-------------------+
|id| r|id| r|
+--+-------------------+--+-------------------+
| 1|0.47027138530546275| 1|0.11616379100300933|
| 2|0.11616379100300933| 2| 0.8588851155739579|
| 3|0.45008521832568693| 3| 0.9959647025839259|
| 4| 0.9959647025839259| 4| 0.5910417491366206|
| 7|0.22327628846133507| 7|0.24223739932588373|
| 9| 0.8395518879513995| 9| 0.8994457593465164|
+--+-------------------+--+-------------------+
{code}
> Use a Random operator to handle Random distribution generating expressions
> --------------------------------------------------------------------------
>
> Key: SPARK-8599
> URL: https://issues.apache.org/jira/browse/SPARK-8599
> Project: Spark
> Issue Type: Improvement
> Affects Versions: 1.4.0
> Reporter: Yin Huai
> Priority: Critical
>
> Right now, we are using expressions for Random distribution generating
> expressions. But, we have to track them in lots of places in the optimizer to
> handle them carefully. Otherwise, these expressions will be treated as
> stateless expressions and have unexpected behaviors (e.g. SPARK-8023).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]