GitHub user oliverpierson opened a pull request:

    https://github.com/apache/spark/pull/11377

    [SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large 
DataFrames

    ## What changes were proposed in this pull request?
    
    Fixes a bug in QuantileDiscretizer that results in the wrong number of bins 
for datasets larger than 10K rows and adds regression test.  This PR corrects 
an issue with PR #11319. 
    
    
    ## How was this patch tested?
    
    Manual tests and test-only QuantileDiscretizerSuite


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/oliverpierson/spark SPARK-13444

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11377.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11377
    
----
commit 635fb4e78433e8760150d775d41b6af9b3cba976
Author: Oliver Pierson <[email protected]>
Date:   2016-02-23T02:20:48Z

    fixed splits bug in QuantileDiscretizer

commit d5dbaa251f55f7763712d5fb45682a2125a86bb8
Author: Oliver Pierson <[email protected]>
Date:   2016-02-23T17:58:42Z

    explicitly cast requiredSamples to Double

commit 4892fb7004a7690855e5c992667e5d27b0317107
Author: Oliver Pierson <[email protected]>
Date:   2016-02-23T18:23:20Z

    added minSamplesRequired parameter to QuantileDiscretizer

commit 3b55b6023e92ef22a7f7961c4625979d9cc811c4
Author: Oliver Pierson <[email protected]>
Date:   2016-02-23T21:19:38Z

    test for QuantileDiscretizer on large datasets

commit c0052e4bbf6fce19a57b96cdcc342874525dd091
Author: Oliver Pierson <[email protected]>
Date:   2016-02-24T15:14:23Z

    private-tize minSamplesRequired; updated comments

commit abea8765f66d061fca1d5358660fb71e9a194cc2
Author: Oliver Pierson <[email protected]>
Date:   2016-02-24T15:20:18Z

    change QuantileDiscretizer test name to better reflect purpose

commit 4356da2607e9c34a4fce9574fa862305d93c3193
Author: Oliver Pierson <[email protected]>
Date:   2016-02-25T22:05:47Z

    remove setSeed in test suite causing build fail

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to