Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19527
Benchmark against existing one hot encoder.
Because existing encoder only needs to run `transform`, there is no fitting
time.
Transforming:
numColums | Existing one hot encoder
-- | --
1 | 0.2516055188
100 | 20.291758921100005
1000 | 26242.039411932*
* Because ten iterations take too long to finish, I just ran one iteration
for 1000 columns. But it shows the scale already.
Benchmark codes:
```scala
import org.apache.spark.ml.feature._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random
val seed = 123l
val random = new Random(seed)
val n = 10000
val m = 1000
val rows = sc.parallelize(1 to n).map(i=>
Row(Array.fill(m)(random.nextInt(1000)): _*))
val struct = new StructType(Array.range(0,m,1).map(i =>
StructField(s"c$i",IntegerType,true)))
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()
val inputCols = Array.range(0,m,1).map(i => s"c$i")
val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded")
val encoders = Array.range(0,m,1).map(i => new
OneHotEncoder().setInputCol(s"c$i").setOutputCol(s"c${i}_encoded"))
var duration = 0.0
for (i <- 0 until 10) {
var encoded = df
val start = System.nanoTime()
encoders.foreach { encoder =>
encoded = encoder.transform(encoded)
}
encoded.count
val end = System.nanoTime()
duration += (end - start) / 1e9
}
println(s"duration: ${duration / 10}")
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]