[
https://issues.apache.org/jira/browse/SPARK-13444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-13444.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.6.2
2.0.0
Issue resolved by pull request 11319
[https://github.com/apache/spark/pull/11319]
> QuantileDiscretizer chooses bad splits on large DataFrames
> ----------------------------------------------------------
>
> Key: SPARK-13444
> URL: https://issues.apache.org/jira/browse/SPARK-13444
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.6.0
> Reporter: Oliver Pierson
> Fix For: 2.0.0, 1.6.2
>
>
> In certain circumstances, QuantileDiscretizer fails to calculate the correct
> splits and will instead split data into two bins regardless of the value
> specified in numBuckets.
> For example, supposed dataset.count is 200 million. And we do
> val discretizer = new QuantileDiscretizer().setNumBuckets(10)
> ... set output and input columns ...
> val dataWithBins = discretizer.fit(dataset).transform(dataset)
> In this case, dataWithBins will have only two distinct bins versus the
> expected 10.
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed
> by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 10000)
> after: val requiredSamples = math.max(numBins * numBins, 10000.0)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]