Github user zhengruifeng commented on the issue:
    I test the performance on a small data, the value in the following table is 
the average duration in seconds:
    |numColums| Old Mean | Old Median | New Mean | New Median |
    We can see that, even on a small data, the speedup is significant.
    On big dataset that do not fit in memory, we should obtain better speedup.
    and the test code is here:
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import spark.implicits._
    import scala.util.Random
    val seed = 123l
    val random = new Random(seed)
    val n = 10000
    val m = 100
    val rows = sc.parallelize(1 to n).map(i=> 
Row(Array.fill(m)(random.nextDouble): _*))
    val struct = new StructType(Array.range(0,m,1).map(i => 
    val df = spark.createDataFrame(rows, struct)
    for (strategy <- Seq("mean", "median"); k <- Seq(1,10,100)) {
    val imputer = new 
    var duration = 0.0
    for (i<- 0 until 10) {
    val start = System.nanoTime()
    val end = System.nanoTime()
    duration += (end - start) / 1e9
    println((strategy, k, duration/10))

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to