zhengruifeng commented on pull request #29850:
URL: https://github.com/apache/spark/pull/29850#issuecomment-698138577
test code:
```
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
var df = sc.range(0, 1000000, 1, 4).map(Tuple1.apply).toDF("real0")
(1 until 100).foreach{i => df = df.withColumn(s"real$i", lit(i.toDouble))}
df.cache
df.count
val n = 1000000
val realCols = df.schema.fieldNames
val hasher = new
FeatureHasher().setInputCols(realCols).setOutputCol("features").setNumFeatures(n)
val start = System.currentTimeMillis; Seq.range(0, 1000).foreach{ i =>
hasher.transform(df).count()}; val end = System.currentTimeMillis; end - start
```
Dataset: 100 numerical columns, 1000,000 rows;
Output dim: 1000,000;
result:
this PR:
start: Long = 1600927462783
end: Long = 1600927521829
res3: Long = 59046
Master:
start: Long = 1600927779679
end: Long = 1600927845831
res3: Long = 66152
@MLnick I just do a simple test, and it shows that we can obtain about 11%
speedup.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]