[ 
https://issues.apache.org/jira/browse/SPARK-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12590.
-------------------------------
    Resolution: Not A Problem

Yes, I think you've hit it on the head: the issue is that you're recomputing a 
non-deterministic RDD, and so you aren't seeing consistent results. However, 
the non-determinism isn't actually the randomSplit call (it's seeded even), but 
the repartition with a shuffle. Generally, this behavior is as expected, and 
indeed, you have to cache rdd2 in order to not recompute it.

> Inconsistent behavior of randomSplit in YARN mode
> -------------------------------------------------
>
>                 Key: SPARK-12590
>                 URL: https://issues.apache.org/jira/browse/SPARK-12590
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, Spark Core
>    Affects Versions: 1.5.2
>         Environment: YARN mode
>            Reporter: Gaurav Kumar
>
> I noticed an inconsistent behavior when using rdd.randomSplit when the source 
> rdd is repartitioned, but only in YARN mode. It works fine in local mode 
> though.
> *Code:*
> val rdd = sc.parallelize(1 to 1000000)
> val rdd2 = rdd.repartition(64)
> rdd.partitions.size
> rdd2.partitions.size
> val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
> train.takeOrdered(10)
> test.takeOrdered(10)
> *Master: local*
> Both the take statements produce consistent results and have no overlap in 
> numbers being outputted.
> *Master: YARN*
> However, when these are run on YARN mode, these produce random results every 
> time and also the train and test have overlap in the numbers being outputted.
> If I use rdd.randomSplit, then it works fine even on YARN.
> So, it concludes that the repartition is being evaluated every time the 
> splitting occurs.
> Interestingly, if I cache the rdd2 before splitting it, then we can expect 
> consistent behavior since repartition is not evaluated again and again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to