[
https://issues.apache.org/jira/browse/SPARK-13444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158244#comment-15158244
]
Apache Spark commented on SPARK-13444:
--------------------------------------
User 'oliverpierson' has created a pull request for this issue:
https://github.com/apache/spark/pull/11319
> QuantileDiscretizer chooses bad splits on large DataFrames
> ----------------------------------------------------------
>
> Key: SPARK-13444
> URL: https://issues.apache.org/jira/browse/SPARK-13444
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.6.0
> Reporter: Oliver Pierson
>
> In certain circumstances, QuantileDiscretizer will fails to calculate the
> correct splits and will instead split data into two bin regardless of the
> value specified in numBuckets.
> For example, supposed dataset.count is 200 million. And we do
> val discretizer = new QuantileDiscretizer().setNumBuckets(10)
> ... set output and input columns ...
> val dataWithBins = discretizer.fit(dataset).transform(dataset)
> In this case, dataWithBins will have only two distinct bins versus the
> expected 10.
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed
> by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 10000)
> after: val requiredSamples = math.max(numBins * numBins, 10000.0)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]