srowen commented on issue #25404: [SPARK-28683][BUILD] Upgrade Scala to 2.12.10 URL: https://github.com/apache/spark/pull/25404#issuecomment-532278778 Well... that's weird. Lots of tests fail because they get slightly different answers, and almost all look like tests that depend on a seeded random number generator at some level. Take https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110781/testReport/test.org.apache.spark/JavaAPISuite/sample/ for example. ``` @Test public void sample() { List<Integer> ints = IntStream.iterate(1, x -> x + 1) .limit(20) .boxed() .collect(Collectors.toList()); JavaRDD<Integer> rdd = sc.parallelize(ints); // the seeds here are "magic" to make this work out nicely JavaRDD<Integer> sample20 = rdd.sample(true, 0.2, 8); assertEquals(2, sample20.count()); JavaRDD<Integer> sample20WithoutReplacement = rdd.sample(false, 0.2, 2); assertEquals(4, sample20WithoutReplacement.count()); } ``` It's seeded, so ought to produce the same answer. This has worked the same way for a long time, giving that particular answer with that particular seed. It's pretty straightforwardly using a java.util.Random, seeded. What could have changed? At the moment I'm wondering if somehow a compiler or collections change causes the input to be partitioned differently or iterated over differently. Still thinking about it and reviewing the (pretty minor) Scala changes. One way or the other I don't think we can let the behavior change for a seeded sample, but not clear whether it's some subtle assumption Spark makes or something else going on.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
