GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/16661

    [SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features

    ## What changes were proposed in this pull request?
    
    The following test will fail on current master
    
    ````scala
    test("gmm fails on high dimensional data") {
        val ctx = spark.sqlContext
        import ctx.implicits._
        val df = Seq(
          Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), 
Array(3.0, 8.0)),
          Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), 
Array(4.0, 9.0)))
          .map(Tuple1.apply).toDF("features")
        val gm = new GaussianMixture()
        intercept[IllegalArgumentException] {
          gm.fit(df)
        }
      }
    ````
    
    Instead, you'll get an `ArrayIndexOutOfBoundsException` or something 
similar for MLlib. That's because the covariance matrix allocates an array of 
`numFeatures * numFeatures`, and in this case we get integer overflow. While 
there is currently a warning that the algorithm does not perform well for high 
number of features, we should perform an appropriate check to communicate this 
limitation to users.
    
    This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` 
check to ML and MLlib algorithms. For the feature limitation, we can limit it 
such that we do not get numerical overflow to something like 
`math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic 
error. However in, for example WLS, we need to collect an array on the order of 
`numFeatures * numFeatures` to the driver and we therefore limit to 4096 
features. We may want to keep that convention here for consistency.
    
    ## How was this patch tested?
    Unit tests in ML and MLlib.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark gmm_high_dim

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16661
    
----
commit b5ae5bde519c80a9584a8e6429a54b2474b9c3ac
Author: sethah <[email protected]>
Date:   2017-01-20T17:13:39Z

    numFeatures check

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to