[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

MaxGekk Sat, 08 Sep 2018 09:17:03 -0700

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/22365


    [SPARK-25381][SQL] Stratified sampling by Column argument

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to add an overloaded method for `sampleBy` which 
accepts the first argument of the `Column` type. This will allow to sample by 
any complex columns as well as sampling by multiple columns. For example:
    
    ```Scala
    spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 
17),
      ("Alice", 10))).toDF("name", "age")
      .stat
      .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, 
Row("Nico", 8) -> 1.0), 36L)
      .show()
    
    +-----+---+
    | name|age|
    +-----+---+
    | Nico|  8|
    |Alice| 10|
    +-----+---+
    ```
    
    ## How was this patch tested?
    
    Added new test for sampling by multiple columns for Scala and test for 
Java, Python to check that `sampleBy` is able to sample by `Column` type 
argument.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 sample-by-column

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22365
    
----
commit 3832f2137676a76d6d06a0bb6dbcedcba801910b
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-09-08T13:30:49Z

    Adding overloaded sampleBy with Column type

commit 5cd3229ce8bfe894dac8ebc097109da237d95401
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-09-08T13:39:30Z

    Adding overloaded sampleBy with Column type for Java

commit e2e61498c47da9d7b36d2e0727ce8642d5d71472
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-09-08T14:56:36Z

    Adding overloaded sampleBy with Column type for Python

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Reply via email to