zhengruifeng commented on pull request #29850: URL: https://github.com/apache/spark/pull/29850#issuecomment-698138577
test code: ``` import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg._ var df = sc.range(0, 1000000, 1, 4).map(Tuple1.apply).toDF("real0") (1 until 100).foreach{i => df = df.withColumn(s"real$i", lit(i.toDouble))} df.cache df.count val n = 1000000 val realCols = df.schema.fieldNames val hasher = new FeatureHasher().setInputCols(realCols).setOutputCol("features").setNumFeatures(n) val start = System.currentTimeMillis; Seq.range(0, 1000).foreach{ i => hasher.transform(df).count()}; val end = System.currentTimeMillis; end - start ``` Dataset: 100 numerical columns, 1000,000 rows; Output dim: 1000,000; result: this PR: start: Long = 1600927462783 end: Long = 1600927521829 res3: Long = 59046 Master: start: Long = 1600927779679 end: Long = 1600927845831 res3: Long = 66152 @MLnick I just do a simple test, and it shows that we can obtain about 11% speedup. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org