peter-toth opened a new pull request, #42997:
URL: https://github.com/apache/spark/pull/42997

   ### What changes were proposed in this pull request?
   This PR fixes a bug regarding non-deterministic seeded Dataset functions.
   
   If we run the following example the result is the expected equal 2 columns:
   ```
   val c = rand()
   df.select(c, c)
   
   +--------------------------+--------------------------+
   |rand(-4522010140232537566)|rand(-4522010140232537566)|
   +--------------------------+--------------------------+
   |        0.4520819282997137|        0.4520819282997137|
   +--------------------------+--------------------------+
   ```
   
   But if we run use other similar APIs their result is incorrect:
   ```
   val r1 = random()
   val r2 = uuid()
   val r3 = shuffle(col("x"))
   val x = df.select(r1, r1, r2, r2, r3, r3)
   
   
+------------------+------------------+--------------------+--------------------+----------+----------+
   |            rand()|            rand()|              uuid()|              
uuid()|shuffle(x)|shuffle(x)|
   
+------------------+------------------+--------------------+--------------------+----------+----------+
   
|0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...|
 [1, 2, 3]| [2, 1, 3]|
   
+------------------+------------------+--------------------+--------------------+----------+----------+
   ```
   
   This is because the current implementation of `rand()` passes a random seed 
to `Rand`, but other functions like `random()`, `uuid()` and `shuffle()` don’t. 
Later the `ResolveRandomSeed` rule is adds the necessary seeds but since the 
resolution rules don’t track expression object identities they can’t map an 
expression object 2 times to the same transformed object. I.e. in case of 
`random()` the `UnresolvedFunction("random", Seq.empty, ...)` object is 
transformed to 2 different `Rand(UnresolvedSeed)` objects and then 2 different 
random seeds are chosen.
   
   This PR explicitely adds the seeds.
   
   ### Why are the changes needed?
   To fix the above bug.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, fixes the above bug.
   
   ### How was this patch tested?
   Added new UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to