[
https://issues.apache.org/jira/browse/SPARK-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075902#comment-15075902
]
Gaurav Kumar commented on SPARK-12590:
--------------------------------------
Thanks [~srowen] for the explanation.
I think most users, unaware of such behavior, tend to do either of these 2
kinds of things:
1. Cache the source RDD and then do a {{randomSplit}} and use the train and
test going forward. This won't be an issue since the source RDD is cached.
2. Do a {{randomSplit}} and then cache train and test separately. This will
create an issue with the splitting.
I think, there should be a warning of some sort in the randomSplit's
documentation bewaring the users of such behavior. It took me quite a while to
debug the overlap between train and test sets.
> Inconsistent behavior of randomSplit in YARN mode
> -------------------------------------------------
>
> Key: SPARK-12590
> URL: https://issues.apache.org/jira/browse/SPARK-12590
> Project: Spark
> Issue Type: Bug
> Components: MLlib, Spark Core
> Affects Versions: 1.5.2
> Environment: YARN mode
> Reporter: Gaurav Kumar
>
> I noticed an inconsistent behavior when using rdd.randomSplit when the source
> rdd is repartitioned, but only in YARN mode. It works fine in local mode
> though.
> *Code:*
> val rdd = sc.parallelize(1 to 1000000)
> val rdd2 = rdd.repartition(64)
> rdd.partitions.size
> rdd2.partitions.size
> val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
> train.takeOrdered(10)
> test.takeOrdered(10)
> *Master: local*
> Both the take statements produce consistent results and have no overlap in
> numbers being outputted.
> *Master: YARN*
> However, when these are run on YARN mode, these produce random results every
> time and also the train and test have overlap in the numbers being outputted.
> If I use rdd.randomSplit, then it works fine even on YARN.
> So, it concludes that the repartition is being evaluated every time the
> splitting occurs.
> Interestingly, if I cache the rdd2 before splitting it, then we can expect
> consistent behavior since repartition is not evaluated again and again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]