GitHub user smartnut007 opened a pull request:
https://github.com/apache/spark/pull/462
SPARK-1438 RDD make seed optional in RDD methods sam...
Its probably better to let the underlying language implementation take care
of the default seed if none is specified by the user. This was easier to do
with python as the default value for seed in random and numpy random is None.
In Scala/Java side it might meen propagating an Option or null(oh no!) down
the chain until where the Random is constructed. But, looks like the convention
in some other methods was to use System.nanoTime. So, followed that convention.
Conflict with overloaded method in sql.SchemaRDD
SchemaRDD defines an overloaded method
sample(fraction, withReplacement=false, seed=math.random)
So, SchemaRDD had tow sample methods with same parameters in different
order. I believe the author intended to override the RDD.sample method and not
overload it. So, changed it.
Also, scala does not allow more than overloaded method to have default
params. So, this code had to be modified. Not sure if there is exiting
application code that might break because of this. If we need to keep things
backward compatible, 3 new method can be introduced (without default params)
like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)
Added some tests for the scala RDD takeSample method. Was able to test the
java side manually.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/smartnut007/spark branch-1.0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/462.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #462
----
commit cb240b3c52149b2afc1195752c3ec0438bb0cd10
Author: Arun Ramakrishnan <[email protected]>
Date: 2014-04-21T07:41:09Z
SPARK-1438 RDD language apis to support optional seed in RDD methods
sample/takeSample
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---