[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

jkbradley Thu, 19 Apr 2018 18:22:46 -0700

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182924903
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
    @@ -120,11 +123,32 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
     
    +  @Since("2.4.0")
    +  def featureToVector(dataset: Dataset[_], col: Column): Column = {
    +    val featuresDataType = dataset.schema(getFeaturesCol).dataType
    +    val transferUDF = featuresDataType match {
    +      case _: VectorUDT => udf((vector: Vector) => vector)
    --- End diff --
    
    Just return ```col(getFeaturesCol)``` since that will be more efficient.  
(Calling a UDF requires data serialization overhead.)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Reply via email to