[ https://issues.apache.org/jira/browse/SPARK-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075902#comment-15075902 ]
Gaurav Kumar commented on SPARK-12590: -------------------------------------- Thanks [~srowen] for the explanation. I think most users, unaware of such behavior, tend to do either of these 2 kinds of things: 1. Cache the source RDD and then do a {{randomSplit}} and use the train and test going forward. This won't be an issue since the source RDD is cached. 2. Do a {{randomSplit}} and then cache train and test separately. This will create an issue with the splitting. I think, there should be a warning of some sort in the randomSplit's documentation bewaring the users of such behavior. It took me quite a while to debug the overlap between train and test sets. > Inconsistent behavior of randomSplit in YARN mode > ------------------------------------------------- > > Key: SPARK-12590 > URL: https://issues.apache.org/jira/browse/SPARK-12590 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core > Affects Versions: 1.5.2 > Environment: YARN mode > Reporter: Gaurav Kumar > > I noticed an inconsistent behavior when using rdd.randomSplit when the source > rdd is repartitioned, but only in YARN mode. It works fine in local mode > though. > *Code:* > val rdd = sc.parallelize(1 to 1000000) > val rdd2 = rdd.repartition(64) > rdd.partitions.size > rdd2.partitions.size > val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1) > train.takeOrdered(10) > test.takeOrdered(10) > *Master: local* > Both the take statements produce consistent results and have no overlap in > numbers being outputted. > *Master: YARN* > However, when these are run on YARN mode, these produce random results every > time and also the train and test have overlap in the numbers being outputted. > If I use rdd.randomSplit, then it works fine even on YARN. > So, it concludes that the repartition is being evaluated every time the > splitting occurs. > Interestingly, if I cache the rdd2 before splitting it, then we can expect > consistent behavior since repartition is not evaluated again and again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org