[GitHub] [spark] zhengruifeng commented on pull request #29850: [SPARK-32974][ML] FeatureHasher transform optimization

GitBox Wed, 23 Sep 2020 23:19:40 -0700


zhengruifeng commented on pull request #29850:
URL: https://github.com/apache/spark/pull/29850#issuecomment-698138577



   test code:
   ```
   import org.apache.spark.ml.feature._
   import org.apache.spark.ml.linalg._
   
   var df = sc.range(0, 1000000, 1, 4).map(Tuple1.apply).toDF("real0")
   (1 until 100).foreach{i => df = df.withColumn(s"real$i", lit(i.toDouble))}
   
   df.cache
   df.count
   
   val n = 1000000
   val realCols = df.schema.fieldNames
   val hasher = new 
FeatureHasher().setInputCols(realCols).setOutputCol("features").setNumFeatures(n)
   val start = System.currentTimeMillis; Seq.range(0, 1000).foreach{ i => 
hasher.transform(df).count()};  val end = System.currentTimeMillis; end - start
   ```
   
   Dataset: 100 numerical columns, 1000,000 rows;
   Output dim: 1000,000;
   
   result:
   
   this PR:
   start: Long = 1600927462783
   end: Long = 1600927521829
   res3: Long = 59046
   
   Master:
   start: Long = 1600927779679
   end: Long = 1600927845831
   res3: Long = 66152
   
   @MLnick I just do a simple test, and it shows that we can obtain about 11% 
speedup.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #29850: [SPARK-32974][ML] FeatureHasher transform optimization

Reply via email to