Barry Becker created SPARK-21986:
------------------------------------
Summary: QuantileDiscretizer picks wrong split point for data with
lots of 0's
Key: SPARK-21986
URL: https://issues.apache.org/jira/browse/SPARK-21986
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 2.1.1
Reporter: Barry Becker
I have some simple test cases to help illustrate (see below).
I discovered this with data that had 96,000 rows, but can reproduce with much
smaller data that has roughly the same distribution of values.
If I have data like
Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
and ask for 3 buckets, then it does the right thing and yields splits of
Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
However, if I add just one more zero, such that I have data like
Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
then it will do the wrong thing and give splits of
Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
I'm not bothered that it gave fewer buckets than asked for (that is to be
expected), but I am bothered that it picked 0.0 instead of 40 as the one split
point.
The way it did it, now I have 1 bucket with all the data, and a second with
none of the data.
Am I interpreting something wrong?
Here are my 2 test cases in scala:
{code}
class QuantileDiscretizerSuite extends FunSuite {
test("Quantile discretizer on data with lots of 0") {
verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
}
test("Quantile discretizer on data with one less 0") {
verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
}
def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
val theData: Seq[(Int, Double)] = data.map {
case x: Int => (x, 0.0)
case _ => (0, 0.0)
}
val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol",
"unused")
val qb = new QuantileDiscretizer()
.setInputCol("rawCol")
.setOutputCol("binnedColumn")
.setRelativeError(0.0)
.setNumBuckets(3)
.fit(df)
assertResult(expectedSplits) {qb.getSplits}
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]