[ 
https://issues.apache.org/jira/browse/SPARK-13444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158235#comment-15158235
 ] 

Oliver Pierson commented on SPARK-13444:
----------------------------------------

I've made the changes in my local fork and can put together a PR tomorrow.


> QuantileDiscretizer chooses bad splits on large DataFrames
> ----------------------------------------------------------
>
>                 Key: SPARK-13444
>                 URL: https://issues.apache.org/jira/browse/SPARK-13444
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.0
>            Reporter: Oliver Pierson
>
> In certain circumstances, QuantileDiscretizer will fails to calculate the 
> correct splits and will instead split data into two bin regardless of the 
> value specified in numBuckets.
> For example, supposed dataset.count is 200 million.  And we do
> val discretizer = new QuantileDiscretizer().setNumBuckets(10)
>   ... set output and input columns ...
> val dataWithBins = discretizer.fit(dataset).transform(dataset)
> In this case, dataWithBins will have only two distinct bins versus the 
> expected 10.
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed 
> by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 10000)
> after: val requiredSamples = math.max(numBins * numBins, 10000.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to