GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/19243
[SPARK-21780][R] Simpler Dataset.sample API in R
## What changes were proposed in this pull request?
This PR make `sample(...)` able to omit `withReplacement` defaulting to
`FALSE`, consistently with equivalent Scala / Java / Python API.
In short, the following examples are allowed:
```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, 0.5, 3))
[1] 4
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, withReplacement=TRUE, fraction=0.5, seed=3))
[1] 2
> count(sample(df, 1.0))
[1] 10
> count(sample(df, fraction=1.0))
[1] 10
> count(sample(df, FALSE, fraction=1.0))
[1] 10
> count(sample(df, 1.0, withReplacement=FALSE))
[1] 10
```
In addition, this PR also adds some type checking logics as below:
```r
> sample(df)
Error in sample(df) :
x (required), withReplacement (optional), fraction (required) and seed
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got
[SparkDataFrame]
> sample(df, "a")
Error in sample(df, "a") :
x (required), withReplacement (optional), fraction (required) and seed
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got
[SparkDataFrame, character]
> sample(df, TRUE, seed="abc")
Error in sample(df, TRUE, seed = "abc") :
x (required), withReplacement (optional), fraction (required) and seed
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got
[SparkDataFrame, logical, character]
> sample(df, -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction
(-1.0) must be on interval [0, 1] without replacement
```
## How was this patch tested?
Manually tested, unit tests added in
`R/pkg/tests/fulltests/test_sparkSQL.R`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-21780
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19243.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19243
----
commit 680157ef95e5ef4a898e339749d6a8bb2d464991
Author: hyukjinkwon <[email protected]>
Date: 2017-09-15T07:10:09Z
Simpler Dataset.sample API in R
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]