GitHub user shahidki31 opened a pull request:
https://github.com/apache/spark/pull/21509
Check for invalid input type of weight data in ml.PowerIterationClustering
## What changes were proposed in this pull request?
The test case will result the following failure. currently in ml.PIC, there
is no check for the data type of weight column.
```
test("invalid input types for weight") {
val invalidWeightData = spark.createDataFrame(Seq(
(0L, 1L, "a"),
(2L, 3L, "b")
)).toDF("src", "dst", "weight")
val pic = new PowerIterationClustering()
.setWeightCol("weight")
val result = pic.assignClusters(invalidWeightData)
}
```
```
Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost,
executor driver): scala.MatchError: [0,1,null] (of class
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at
org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
at
org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
```
In this PR, added check types for weight column.
## How was this patch tested?
UT added
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/shahidki31/spark testCasePic
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21509.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21509
----
commit 0d6d7be494b6b331a09d91b15a29bac98eac4c74
Author: Shahid <shahidki31@...>
Date: 2018-06-07T20:58:24Z
Example code for Power Iteration Clustering
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]