[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187932#comment-17187932 ]
Sean R. Owen commented on SPARK-32728: -------------------------------------- I think this is to be expected: you are going to get different values every time x is evaluated. It is not somehow fixed at the time you declare the column. I would expect it, however, to work as you are expecting if you materialize it with cache() + count() for example. Even then there are circumstances where it's reevaluated. > Using groupby with rand creates different values when joining table with > itself > ------------------------------------------------------------------------------- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes > Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > > Reporter: Joachim Bargsten > Priority: Minor > > When running following query in a python3 notebook on a cluster with > *multiple workers (>1)*, the result is not 0.0, even though I would expect it > to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org