Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19527
  
    Benchmark against existing one hot encoder.
    
    Because existing encoder only needs to run `transform`, there is no fitting 
time.
    
    
    Transforming:
    
    numColums | Existing one hot encoder
    -- | -- 
    1 | 0.2516055188
    100 | 20.291758921100005
    1000 | 26242.039411932*
    
    * Because ten iterations take too long to finish, I just ran one iteration 
for 1000 columns. But it shows the scale already.
    
    Benchmark codes:
    
    ```scala
    import org.apache.spark.ml.feature._
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import spark.implicits._
    import scala.util.Random
    
    val seed = 123l
    val random = new Random(seed)
    val n = 10000
    val m = 1000
    val rows = sc.parallelize(1 to n).map(i=> 
Row(Array.fill(m)(random.nextInt(1000)): _*))
    val struct = new StructType(Array.range(0,m,1).map(i => 
StructField(s"c$i",IntegerType,true)))
    val df = spark.createDataFrame(rows, struct)
    df.persist()
    df.count()
    
    val inputCols = Array.range(0,m,1).map(i => s"c$i")
    val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded")
    
    val encoders = Array.range(0,m,1).map(i => new 
OneHotEncoder().setInputCol(s"c$i").setOutputCol(s"c${i}_encoded"))
    var duration = 0.0
    for (i <- 0 until 10) {
      var encoded = df
      val start = System.nanoTime()
      encoders.foreach { encoder =>
        encoded = encoder.transform(encoded)
      }
      encoded.count
      val end = System.nanoTime()
      duration += (end - start) / 1e9
    }
    println(s"duration: ${duration / 10}")
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to