GitHub user smartnut007 opened a pull request:
https://github.com/apache/spark/pull/477
SPARK-1438 RDD.sample() make seed param optional
copying form previous pull request https://github.com/apache/spark/pull/462
Its probably better to let the underlying language implementation take care
of the default . This was easier to do with python as the default value for
seed in random and numpy random is None.
In Scala/Java side it might mean propagating an Option or null(oh no!) down
the chain until where the Random is constructed. But, looks like the convention
in some other methods was to use System.nanoTime. So, followed that convention.
Conflict with overloaded method in sql.SchemaRDD.sample which also defines
default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I
believe the author intended to override the RDD.sample method and not overload
it. So, changed it.
If backward compatible is important, 3 new method can be introduced
(without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)
Added some tests for the scala RDD takeSample method.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/smartnut007/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/477.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #477
----
commit 0c247dba6084313873b539bcf230371c903f04b3
Author: Arun Ramakrishnan <[email protected]>
Date: 2014-04-21T07:41:09Z
SPARK-1438 RDD language apis to support optional seed in RDD methods
sample/takeSample
commit 69619c6686cc7ff7113f8ef031f3ed3698bafa25
Author: Arun Ramakrishnan <[email protected]>
Date: 2014-04-22T04:37:22Z
SPARK-1438 fix spacing issue
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---