[jira] [Updated] (SPARK-13444) QuantileDiscretizer chooses bad splits on large DataFrames

Oliver Pierson (JIRA) Mon, 22 Feb 2016 20:12:32 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-13444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Oliver Pierson updated SPARK-13444:
-----------------------------------
    Description: 
In certain circumstances, QuantileDiscretizer fails to calculate the correct 
splits and will instead split data into two bins regardless of the value 
specified in numBuckets.

For example, supposed dataset.count is 200 million.  And we do

val discretizer = new QuantileDiscretizer().setNumBuckets(10)
  ... set output and input columns ...
val dataWithBins = discretizer.fit(dataset).transform(dataset)

In this case, dataWithBins will have only two distinct bins versus the expected 
10.

Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by 
changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)



  was:
In certain circumstances, QuantileDiscretizer fails to calculate the correct 
splits and will instead split data into two bin regardless of the value 
specified in numBuckets.

For example, supposed dataset.count is 200 million.  And we do

val discretizer = new QuantileDiscretizer().setNumBuckets(10)
  ... set output and input columns ...
val dataWithBins = discretizer.fit(dataset).transform(dataset)

In this case, dataWithBins will have only two distinct bins versus the expected 
10.

Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by 
changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)




> QuantileDiscretizer chooses bad splits on large DataFrames
> ----------------------------------------------------------
>
>                 Key: SPARK-13444
>                 URL: https://issues.apache.org/jira/browse/SPARK-13444
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.0
>            Reporter: Oliver Pierson
>
> In certain circumstances, QuantileDiscretizer fails to calculate the correct 
> splits and will instead split data into two bins regardless of the value 
> specified in numBuckets.
> For example, supposed dataset.count is 200 million.  And we do
> val discretizer = new QuantileDiscretizer().setNumBuckets(10)
>   ... set output and input columns ...
> val dataWithBins = discretizer.fit(dataset).transform(dataset)
> In this case, dataWithBins will have only two distinct bins versus the 
> expected 10.
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed 
> by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 10000)
> after: val requiredSamples = math.max(numBins * numBins, 10000.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-13444) QuantileDiscretizer chooses bad splits on large DataFrames

Reply via email to