[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188070#comment-17188070 ]
Takeshi Yamamuro commented on SPARK-32728: ------------------------------------------ +1 on the Sean comment, so I'll close this. > Using groupby with rand creates different values when joining table with > itself > ------------------------------------------------------------------------------- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes > Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > > Reporter: Joachim Bargsten > Priority: Minor > > When running following query in a python3 notebook on a cluster with > *multiple workers (>1)*, the result is not 0.0, even though I would expect it > to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org