GitHub user shahidki31 opened a pull request:
https://github.com/apache/spark/pull/21270
Power Iteration Clustering in SparkML throws exception, when the ID in
IntType
While running the following code, PIC throws exception.
```
val data = spark.createDataFrame(Seq(
(0, Array(1), Array(0.9)),
(1, Array(2), Array(0.9)),
(2, Array(3), Array(0.9)),
(3, Array(4), Array(0.1)),
(4, Array(5), Array(0.9))
)).toDF("id", "neighbors", "similarities")
val result = new PowerIterationClustering()
.setK(2)
.setMaxIter(10)
.setInitMode("random")
.transform(data)
.select("id", "prediction")
```
**Result**
`org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`'
given input columns: [id, neighbors, similarities];;
'Project [id#215, 'prediction]
+- AnalysisBarrier
+- Project [id#215, neighbors#216, similarities#217]
+- Join Inner, (id#215 = id#234)
:- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211
AS similarities#217]
: +- LocalRelation [_1#209, _2#210, _3#211]
+- Project [cast(id#230L as int) AS id#234]
+- LogicalRDD [id#230L, prediction#231], false
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
`
## What changes were proposed in this pull request?
1) PIC needs to return only "id" and "predictions". Currently it returns
the entire data, including neighborhood array and similarity array.
2) MLLib PIC returns "id" as Long, and "predictions" as Int. So, In ML, we
don't need to typecast to the user input ID type. We can directly display the
output of MLLib PIC.
## How was this patch tested?
Added a UT
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/shahidki31/spark sparkSim
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21270.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21270
----
commit f7bb93a1e84821d9777229eb72f06f150c741729
Author: Shahid <shahidki31@...>
Date: 2018-05-08T17:08:50Z
Example code for Power Iteration Clustering
commit ff9e0795dbdcd6f3548ef8e6e73d805bb9b7584e
Author: Shahid <shahidki31@...>
Date: 2018-05-08T20:02:15Z
Example code for Power Iteration Clustering
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]