zhengruifeng opened a new pull request, #47742: URL: https://github.com/apache/spark/pull/47742
### What changes were proposed in this pull request? 1, Introduce an expression `GroupedCount` for multi-column grouped count 2, Use this expression in `StringIndexer` ### Why are the changes needed? when investigating the plotting function, I found that computing grouped count on multiple column is a very common use case, and there are mainly two approaches: 1, `StringIndexer` in ML, using the `StringIndexerAggregator`; https://github.com/apache/spark/blob/e7e082663b94d0cc4d8d3d224bd8d819c4b2f4d3/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L648-L690 2, Histogram plot in PySpark, create a group of intermediate dataframes, then union them and compute. https://github.com/apache/spark/blob/70b814b558beabbdf48331b98755bafeaba4f17b/python/pyspark/pandas/plot/core.py#L200-L238 We can introduce a dedicated expression for this purpose, then we can get: 1, better SerDes than approach 1; 2, simpler plan than approach 2; ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1, existing tests 2, manually test SerDes ``` import org.apache.spark.ml.feature.StringIndexer val numCol = 300 val data = (0 to 10000).map { i => (i, 100 * i) } var df = data.toDF("id", "label0") (1 to numCol).foreach { idx => df = df.withColumn(s"label$idx", col("label0") + 1) } val inputCols = (0 to numCol).map(i => s"label$i").toArray val outputCols = (0 to numCol).map(i => s"labelIndex$i").toArray val indexer = new StringIndexer().setInputCols(inputCols).setOutputCols(outputCols).setStringOrderType("frequencyAsc").fit(df) ``` before  after  we can see that the shuffle size is reduced from 22.0 MiB to 14.9 MiB. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
