Hello,
I've discovered a bug in the QuantileDiscretizer estimator. Specifically,
for large DataFrames QuantileDiscretizer will only create one split (i.e. two
bins).
The error happens in lines 113 and 114 of QuantileDiscretizer.scala:
val requiredSamples = math.max(numBins * numBins, 10000)
val fraction = math.min(requiredSamples / dataset.count(), 1.0)
After the first line, requiredSamples is an Int. Therefore, if requiredSamples
> dataset.count() then fraction is always 0.0.
The problem can be simply fixed by replacing the first with:
val requiredSamples = math.max(numBins * numBins, 10000.0)
I've implemented this change in my fork and all tests passed (except for docker
integration, but I think that's another issue). I'm happy to submit a PR if it
will ease someone else's workload. However, I'm unsure of how to create a
JIRA. I've created an account on the issue tracker (issues.apache.org) but
when I try to create an issue it asks me to choose a "Service Desk". Which one
should I be choosing?
Thanks much,
Oliver Pierson