viirya opened a new pull request #25442: [SPARK-28722][ML] Change sequential 
label sorting in StringIndexer fit to parallel
URL: https://github.com/apache/spark/pull/25442
 
 
   ## What changes were proposed in this pull request?
   
   The `fit` method in `StringIndexer` sorts given labels in a sequential 
approach, if there are multiple input columns. When the number of input column 
increases, the time of label sorting dramatically increases too so it is hard 
to use in practice if dealing with hundreds of input columns.
   
   This patch tries to make the label sorting parallel.
   
   This runs benchmark like:
   ```scala
   import org.apache.spark.ml.feature.StringIndexer
   
   val numCol = 300
   
   val data = (0 to 100).map { i =>
     (i, 100 * i)
   }
   var df = data.toDF("id", "label0")
   (1 to numCol).foreach { idx =>
     df = df.withColumn(s"label$idx", col("label0") + 1)
   }
   val inputCols = (0 to numCol).map(i => s"label$i").toArray
   val outputCols = (0 to numCol).map(i => s"labelIndex$i").toArray
   val t0 = System.nanoTime()
   val indexer = new 
StringIndexer().setInputCols(inputCols).setOutputCols(outputCols).setStringOrderType("alphabetDesc").fit(df)
   val t1 = System.nanoTime()
   println("Elapsed time: " + (t1 - t0) / 1000000000.0 + "s")      
   ```
   
   | numCol  | 20 | 50  | 100  | 200  | 300 |
   |--:|---|---|---|---|---|
   |  Before |  9.85 |  28.62 | 64.35  | 167.17  | 431.60 |
   | After  | 2.44  | 2.71  | 3.34  | 4.83  | 6.90 |
   
   Unit: second
   
   ## How was this patch tested?
   
   Passed existing tests. Manually test for performance.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to