Oliver Pierson created SPARK-13444:
--------------------------------------
Summary: QuantileDiscretizer chooses bad splits on large DataFrames
Key: SPARK-13444
URL: https://issues.apache.org/jira/browse/SPARK-13444
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.6.0
Reporter: Oliver Pierson
In certain circumstances, QuantileDiscretizer will fails to calculate the
correct splits and will instead split data into two bin regardless of the value
specified in numBuckets.
For example, supposed dataset.count is 200 million. And we do
val discretizer = new QuantileDiscretizer().setNumBuckets(10)
... set output and input columns ...
val dataWithBins = discretizer.fit(dataset).transform(dataset)
In this case, dataWithBins will have only two distinct bins versus the expected
10.
Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by
changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]