Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21081#discussion_r182924903
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
---
@@ -120,11 +123,32 @@ class KMeansModel private[ml] (
@Since("2.0.0")
def setPredictionCol(value: String): this.type = set(predictionCol,
value)
+ @Since("2.4.0")
+ def featureToVector(dataset: Dataset[_], col: Column): Column = {
+ val featuresDataType = dataset.schema(getFeaturesCol).dataType
+ val transferUDF = featuresDataType match {
+ case _: VectorUDT => udf((vector: Vector) => vector)
--- End diff --
Just return ```col(getFeaturesCol)``` since that will be more efficient.
(Calling a UDF requires data serialization overhead.)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]