GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/16661
[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features
## What changes were proposed in this pull request?
The following test will fail on current master
````scala
test("gmm fails on high dimensional data") {
val ctx = spark.sqlContext
import ctx.implicits._
val df = Seq(
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4),
Array(3.0, 8.0)),
Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5),
Array(4.0, 9.0)))
.map(Tuple1.apply).toDF("features")
val gm = new GaussianMixture()
intercept[IllegalArgumentException] {
gm.fit(df)
}
}
````
Instead, you'll get an `ArrayIndexOutOfBoundsException` or something
similar for MLlib. That's because the covariance matrix allocates an array of
`numFeatures * numFeatures`, and in this case we get integer overflow. While
there is currently a warning that the algorithm does not perform well for high
number of features, we should perform an appropriate check to communicate this
limitation to users.
This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)`
check to ML and MLlib algorithms. For the feature limitation, we can limit it
such that we do not get numerical overflow to something like
`math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic
error. However in, for example WLS, we need to collect an array on the order of
`numFeatures * numFeatures` to the driver and we therefore limit to 4096
features. We may want to keep that convention here for consistency.
## How was this patch tested?
Unit tests in ML and MLlib.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark gmm_high_dim
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16661.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16661
----
commit b5ae5bde519c80a9584a8e6429a54b2474b9c3ac
Author: sethah <[email protected]>
Date: 2017-01-20T17:13:39Z
numFeatures check
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]