zhengruifeng commented on pull request #29850:
URL: https://github.com/apache/spark/pull/29850#issuecomment-698138577


   test code:
   ```
   import org.apache.spark.ml.feature._
   import org.apache.spark.ml.linalg._
   
   var df = sc.range(0, 1000000, 1, 4).map(Tuple1.apply).toDF("real0")
   (1 until 100).foreach{i => df = df.withColumn(s"real$i", lit(i.toDouble))}
   
   df.cache
   df.count
   
   val n = 1000000
   val realCols = df.schema.fieldNames
   val hasher = new 
FeatureHasher().setInputCols(realCols).setOutputCol("features").setNumFeatures(n)
   val start = System.currentTimeMillis; Seq.range(0, 1000).foreach{ i => 
hasher.transform(df).count()};  val end = System.currentTimeMillis; end - start
   ```
   
   Dataset: 100 numerical columns, 1000,000 rows;
   Output dim: 1000,000;
   
   result:
   
   this PR:
   start: Long = 1600927462783
   end: Long = 1600927521829
   res3: Long = 59046
   
   Master:
   start: Long = 1600927779679
   end: Long = 1600927845831
   res3: Long = 66152
   
   @MLnick I just do a simple test, and it shows that we can obtain about 11% 
speedup.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to