Good catch, though probably very slightly simpler to write

math.min(requiredSamples.toDouble ...

Make sure you're logged in to JIRA maybe. If you have any trouble I'll
open it for you. You can file it as a minor bug against ML.

This is how you open a PR and everything else
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Tue, Feb 23, 2016 at 2:45 AM, Pierson, Oliver C <o...@gatech.edu> wrote:
> Hello,
>
>   I've discovered a bug in the QuantileDiscretizer estimator.  Specifically,
> for large DataFrames QuantileDiscretizer will only create one split (i.e.
> two bins).
>
>
> The error happens in lines 113 and 114 of QuantileDiscretizer.scala:
>
>
>     val requiredSamples = math.max(numBins * numBins, 10000)
>
>     val fraction = math.min(requiredSamples / dataset.count(), 1.0)
>
>
> After the first line, requiredSamples is an Int.  Therefore, if
> requiredSamples > dataset.count() then fraction is always 0.0.
>
>
> The problem can be simply fixed by replacing the first with:
>
>
>   val requiredSamples = math.max(numBins * numBins, 10000.0)
>
>
> I've implemented this change in my fork and all tests passed (except for
> docker integration, but I think that's another issue).  I'm happy to submit
> a PR if it will ease someone else's workload.  However, I'm unsure of how to
> create a JIRA.  I've created an account on the issue tracker
> (issues.apache.org) but when I try to create an issue it asks me to choose a
> "Service Desk".  Which one should I be choosing?
>
>
> Thanks much,
>
> Oliver Pierson
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to