srowen commented on issue #25404: [SPARK-28683][BUILD] Upgrade Scala to 2.12.10
URL: https://github.com/apache/spark/pull/25404#issuecomment-532278778
 
 
   Well... that's weird. Lots of tests fail because they get slightly different 
answers, and almost all look like tests that depend on a seeded random number 
generator at some level.
   
   Take 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110781/testReport/test.org.apache.spark/JavaAPISuite/sample/
 for example.
   
   ```
     @Test
     public void sample() {
       List<Integer> ints = IntStream.iterate(1, x -> x + 1)
         .limit(20)
         .boxed()
         .collect(Collectors.toList());
       JavaRDD<Integer> rdd = sc.parallelize(ints);
       // the seeds here are "magic" to make this work out nicely
       JavaRDD<Integer> sample20 = rdd.sample(true, 0.2, 8);
       assertEquals(2, sample20.count());
       JavaRDD<Integer> sample20WithoutReplacement = rdd.sample(false, 0.2, 2);
       assertEquals(4, sample20WithoutReplacement.count());
     }
   ```
   
   It's seeded, so ought to produce the same answer. This has worked the same 
way for a long time, giving that particular answer with that particular seed. 
It's pretty straightforwardly using a java.util.Random, seeded. What could have 
changed?
   
   At the moment I'm wondering if somehow a compiler or collections change 
causes the input to be partitioned differently or iterated over differently. 
Still thinking about it and reviewing the (pretty minor) Scala changes.
   
   One way or the other I don't think we can let the behavior change for a 
seeded sample, but not clear whether it's some subtle assumption Spark makes or 
something else going on.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to