GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/22365
[SPARK-25381][SQL] Stratified sampling by Column argument
## What changes were proposed in this pull request?
In the PR, I propose to add an overloaded method for `sampleBy` which
accepts the first argument of the `Column` type. This will allow to sample by
any complex columns as well as sampling by multiple columns. For example:
```Scala
spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob",
17),
("Alice", 10))).toDF("name", "age")
.stat
.sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3,
Row("Nico", 8) -> 1.0), 36L)
.show()
+-----+---+
| name|age|
+-----+---+
| Nico| 8|
|Alice| 10|
+-----+---+
```
## How was this patch tested?
Added new test for sampling by multiple columns for Scala and test for
Java, Python to check that `sampleBy` is able to sample by `Column` type
argument.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 sample-by-column
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22365.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22365
----
commit 3832f2137676a76d6d06a0bb6dbcedcba801910b
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-09-08T13:30:49Z
Adding overloaded sampleBy with Column type
commit 5cd3229ce8bfe894dac8ebc097109da237d95401
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-09-08T13:39:30Z
Adding overloaded sampleBy with Column type for Java
commit e2e61498c47da9d7b36d2e0727ce8642d5d71472
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-09-08T14:56:36Z
Adding overloaded sampleBy with Column type for Python
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]