GitHub user shahidki31 opened a pull request:

    https://github.com/apache/spark/pull/21270

    Power Iteration Clustering in SparkML throws exception, when the ID in 
IntType

    While running the following code, PIC throws exception.
    ```
    val data = spark.createDataFrame(Seq(
          (0, Array(1), Array(0.9)),
          (1, Array(2), Array(0.9)),
          (2, Array(3), Array(0.9)),
          (3, Array(4), Array(0.1)),
          (4, Array(5), Array(0.9))
        )).toDF("id", "neighbors", "similarities")
    
    val result = new PowerIterationClustering()
          .setK(2)
          .setMaxIter(10)
          .setInitMode("random")
          .transform(data)
          .select("id", "prediction")
    ```
    
    **Result**
    `org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' 
given input columns: [id, neighbors, similarities];;
    'Project [id#215, 'prediction]
    +- AnalysisBarrier
          +- Project [id#215, neighbors#216, similarities#217]
             +- Join Inner, (id#215 = id#234)
                :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 
AS similarities#217]
                :  +- LocalRelation [_1#209, _2#210, _3#211]
                +- Project [cast(id#230L as int) AS id#234]
                   +- LogicalRDD [id#230L, prediction#231], false
    
        at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
    
    `
    
    
    ## What changes were proposed in this pull request?
    
      1) PIC needs to return only "id" and "predictions". Currently it returns 
the entire data, including neighborhood array and similarity array.
     2) MLLib PIC returns "id" as Long, and "predictions" as Int. So, In ML, we 
don't need to typecast to the user input ID type. We can directly display the 
output of MLLib PIC.
    
    ## How was this patch tested?
    Added a UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shahidki31/spark sparkSim

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21270.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21270
    
----
commit f7bb93a1e84821d9777229eb72f06f150c741729
Author: Shahid <shahidki31@...>
Date:   2018-05-08T17:08:50Z

    Example code for Power Iteration Clustering

commit ff9e0795dbdcd6f3548ef8e6e73d805bb9b7584e
Author: Shahid <shahidki31@...>
Date:   2018-05-08T20:02:15Z

    Example code for Power Iteration Clustering

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to