[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

keypointt Thu, 09 Jun 2016 17:51:02 -0700

GitHub user keypointt opened a pull request:

    https://github.com/apache/spark/pull/13584


    [SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns 
"features" and "label"

    https://issues.apache.org/jira/browse/SPARK-15509
    
    ## What changes were proposed in this pull request?
    
    Currently in SparkR, when you load a LibSVM dataset using the sqlContext 
and then pass it to an MLlib algorithm, the ML wrappers will fail since they 
will try to create a "features" column, which conflicts with the existing 
"features" column from the LibSVM loader. E.g., using the "mnist" dataset from 
LibSVM:
    `training <- loadDF(sqlContext, ".../mnist", "libsvm")`
    `model <- naiveBayes(label ~ features, training)`
    This fails with:
    ```
    16/05/24 11:52:41 ERROR RBackendHandler: fit on 
org.apache.spark.ml.r.NaiveBayesWrapper failed
    Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
      java.lang.IllegalArgumentException: Output column features already exists.
        at 
org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
        at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
        at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
        at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
        at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
        at 
org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
        at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
    The same issue appears for the "label" column once you rename the 
"features" column.
    ```
    The cause is, when using `loadDF()` to generate dataframes, sometimes 
itâs with default column name `âlabelâ` and `âfeaturesâ`, and these 
two name will conflict with default column names `setDefault(labelCol, 
"label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala`
    
    
    
    
    ## How was this patch tested?
    
    Test on my local machine.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/keypointt/spark SPARK-15509

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13584.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13584
    
----
commit cfed8844cbadbd760f73c2f906a1591806001a93
Author: Xin Ren <[email protected]>
Date:   2016-06-08T23:39:28Z

    [SPARK-15509] remove duplicate of intercept[IllegalArgumentException]

commit 77886fe59463027f24c6ca909638731145b46ee2
Author: Xin Ren <[email protected]>
Date:   2016-06-09T20:59:38Z

    [SPARK-15509] no column exists error for naivebayes. expand to other 
wrappers

commit e112ac0c0685f399f72e9ed60be00964ec4fcdc4
Author: Xin Ren <[email protected]>
Date:   2016-06-09T21:04:56Z

    [SPARK-15509] add a util function for all wrappers

commit ef3702ee5beefad1ee51fe15cb01e1716aeda362
Author: Xin Ren <[email protected]>
Date:   2016-06-09T22:27:37Z

    [SPARK-15509] expand column check to other wrappers

commit aab3a12fe09cf3039708468a80837fa421739c69
Author: Xin Ren <[email protected]>
Date:   2016-06-09T23:05:51Z

    [SPARK-15509] add unit test

commit f68ac34907f3a7d1d66e98572ada34d47df3eab9
Author: Xin Ren <[email protected]>
Date:   2016-06-10T00:01:44Z

    [SPARK-15509] some clean up

commit c8e30e9452031908fc829e527ab82a8e93598302
Author: Xin Ren <[email protected]>
Date:   2016-06-10T00:45:53Z

    [SPARK-15509] fix path

commit 43b2f8c5fb9e0d74579b948b1d52cad4faa76b66
Author: Xin Ren <[email protected]>
Date:   2016-06-10T00:48:36Z

    [SPARK-15509] fix path

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Reply via email to