GitHub user oliverpierson opened a pull request:
https://github.com/apache/spark/pull/11377
[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large
DataFrames
## What changes were proposed in this pull request?
Fixes a bug in QuantileDiscretizer that results in the wrong number of bins
for datasets larger than 10K rows and adds regression test. This PR corrects
an issue with PR #11319.
## How was this patch tested?
Manual tests and test-only QuantileDiscretizerSuite
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/oliverpierson/spark SPARK-13444
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11377.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11377
----
commit 635fb4e78433e8760150d775d41b6af9b3cba976
Author: Oliver Pierson <[email protected]>
Date: 2016-02-23T02:20:48Z
fixed splits bug in QuantileDiscretizer
commit d5dbaa251f55f7763712d5fb45682a2125a86bb8
Author: Oliver Pierson <[email protected]>
Date: 2016-02-23T17:58:42Z
explicitly cast requiredSamples to Double
commit 4892fb7004a7690855e5c992667e5d27b0317107
Author: Oliver Pierson <[email protected]>
Date: 2016-02-23T18:23:20Z
added minSamplesRequired parameter to QuantileDiscretizer
commit 3b55b6023e92ef22a7f7961c4625979d9cc811c4
Author: Oliver Pierson <[email protected]>
Date: 2016-02-23T21:19:38Z
test for QuantileDiscretizer on large datasets
commit c0052e4bbf6fce19a57b96cdcc342874525dd091
Author: Oliver Pierson <[email protected]>
Date: 2016-02-24T15:14:23Z
private-tize minSamplesRequired; updated comments
commit abea8765f66d061fca1d5358660fb71e9a194cc2
Author: Oliver Pierson <[email protected]>
Date: 2016-02-24T15:20:18Z
change QuantileDiscretizer test name to better reflect purpose
commit 4356da2607e9c34a4fce9574fa862305d93c3193
Author: Oliver Pierson <[email protected]>
Date: 2016-02-25T22:05:47Z
remove setSeed in test suite causing build fail
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]