Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19527
Benchmark against multi-column one hot encoder.
Multi-Col, Multiple run: The first commit. Run multiple `treeAggregate` on
columns.
Multi-Col, Single Run: Run one `treeAggregate` on all columns, see
suggestion at https://github.com/apache/spark/pull/19527#discussion_r145457081.
Fitting:
numColums | Multi-Col, Multiple run | Multi-Col, Single Run
-- | -- | --
1 | 0.11003638430000003 | 0.12968824099999998
100 | 3.6879334635000007 | 0.36438897839999995
1000 | 90.3695017947 | 2.4687475008
Transforming:
numColums | Multi-Col, Multiple run | Multi-Col, Single Run
-- | -- | --
1 | 0.14080461019999999 | 0.1434849307
100 | 0.3636357813 | 0.41459606969999996
1000 | 3.1933874685 | 2.8026313985
Benchmark codes:
```scala
import org.apache.spark.ml.feature._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random
val seed = 123l
val random = new Random(seed)
val n = 10000
val m = 1000
val rows = sc.parallelize(1 to n).map(i=>
Row(Array.fill(m)(random.nextInt(1000)): _*))
val struct = new StructType(Array.range(0,m,1).map(i =>
StructField(s"c$i",IntegerType,true)))
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()
val inputCols = Array.range(0,m,1).map(i => s"c$i")
val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded")
val encoder = new
OneHotEncoderEstimator().setInputCols(inputCols).setOutputCols(outputCols)
var durationFitting = 0.0
var durationTransforming = 0.0
for (i <- 0 until 10) {
val startFitting = System.nanoTime()
val model = encoder.fit(df)
val endFitting = System.nanoTime()
durationFitting += (endFitting - startFitting) / 1e9
val startTransforming = System.nanoTime()
model.transform(df).count
val endTransforming = System.nanoTime()
durationTransforming += (endTransforming - startTransforming) / 1e9
}
println(s"fitting: ${durationFitting / 10}")
println(s"transforming: ${durationTransforming / 10}")
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]