Github user oliverpierson commented on the pull request:
https://github.com/apache/spark/pull/11402#issuecomment-190372118
After running the test on my machine again, I discovered that it randomly
passes/fails. It appears that the problem is in
[`findSplitsCandidate`](https://github.com/oliverpierson/spark/blob/SPARK-13444/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L123).
This method will give `n+1` buckets under certain circumstances when only `n`
buckets are desired. The reason that the new test randomly passes/fails is
because it involves random sampling of the data in order to estimate the
quantiles.
However, the method can still fail deterministically. For example,
consider the following:
```
val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
val discretizer = new
QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
discretizer.fit(df).getSplits
```
This gives the following splits:
```
Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
```
which corresponds to six buckets.
There are a few ways to fix `findSplitCandidates`. The most
straightforward (albeit, less elegant) way is to track the number of splits
discovered so far while iterating the `while` loop and terminate the loop when
`(index < valueCounts.length && splitsSoFar < numSplits)`. I believe this is
probably the best option for the bug in `branch-1.6`. If there's no objections
I can put a commit together.
As for the `master` branch, I'm considering rewriting the
`findSplitCandidates` method using [the usual method for finding
quantiles.](https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample)
It's done this way in Numpy/Scipy and I believe it would be at least as fast
as the current routine. I'm curious if anybody has any objections or concerns
when it comes to rewrite?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]