[PR] [SPARK-49223][ML][SQL] Introduce an expression for multi-column grouped count [spark]

via GitHub Tue, 13 Aug 2024 06:20:57 -0700


zhengruifeng opened a new pull request, #47742:
URL: https://github.com/apache/spark/pull/47742


   ### What changes were proposed in this pull request?
   1, Introduce an expression `GroupedCount` for multi-column grouped count
   2, Use this expression in `StringIndexer`
   
   ### Why are the changes needed?
   when investigating the plotting function, I found that computing grouped 
count on multiple column is a very common use case, and there are mainly two 
approaches:
   1, `StringIndexer` in ML, using the `StringIndexerAggregator`;
   
https://github.com/apache/spark/blob/e7e082663b94d0cc4d8d3d224bd8d819c4b2f4d3/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L648-L690
   2, Histogram plot in PySpark, create a group of intermediate dataframes, 
then union them and compute.
   
https://github.com/apache/spark/blob/70b814b558beabbdf48331b98755bafeaba4f17b/python/pyspark/pandas/plot/core.py#L200-L238
   
   
   We can introduce a dedicated expression for this purpose, then we can get:
   1, better SerDes than approach 1;
   2, simpler plan than approach 2;
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   1, existing tests
   2, manually test SerDes
   
   ```
   import org.apache.spark.ml.feature.StringIndexer
   
   val numCol = 300
   val data = (0 to 10000).map { i => (i, 100 * i) }
   var df = data.toDF("id", "label0")
   (1 to numCol).foreach { idx =>
     df = df.withColumn(s"label$idx", col("label0") + 1)
   }
   val inputCols = (0 to numCol).map(i => s"label$i").toArray
   val outputCols = (0 to numCol).map(i => s"labelIndex$i").toArray
   val indexer = new 
StringIndexer().setInputCols(inputCols).setOutputCols(outputCols).setStringOrderType("frequencyAsc").fit(df)
   ```
   
   before
   
![image](https://github.com/user-attachments/assets/52bb0ce2-c254-4c1e-a546-b42fb8fb744f)
   
   after
   
![image](https://github.com/user-attachments/assets/48c6d5a5-4bbd-4ac8-97eb-5e8300a81133)
   
   we can see that the shuffle size is reduced from 22.0 MiB to 14.9 MiB.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-49223][ML][SQL] Introduce an expression for multi-column grouped count [spark]

Reply via email to