Oliver Pierson created SPARK-13444:
--------------------------------------

             Summary: QuantileDiscretizer chooses bad splits on large DataFrames
                 Key: SPARK-13444
                 URL: https://issues.apache.org/jira/browse/SPARK-13444
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.6.0
            Reporter: Oliver Pierson


In certain circumstances, QuantileDiscretizer will fails to calculate the 
correct splits and will instead split data into two bin regardless of the value 
specified in numBuckets.

For example, supposed dataset.count is 200 million.  And we do

val discretizer = new QuantileDiscretizer().setNumBuckets(10)
  ... set output and input columns ...
val dataWithBins = discretizer.fit(dataset).transform(dataset)

In this case, dataWithBins will have only two distinct bins versus the expected 
10.

Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by 
changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to