Re: [PR] [SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1) [spark]

via GitHub Tue, 24 Feb 2026 09:20:48 -0800


devin-petersohn commented on code in PR #54370:
URL: https://github.com/apache/spark/pull/54370#discussion_r2848506888



##########
python/pyspark/pandas/frame.py:
##########
@@ -9966,33 +9966,38 @@ def describe(self, percentiles: Optional[List[float]] = 
None) -> "DataFrame":
         has_numeric_type = len(psser_numeric) > 0
 
         if is_all_string_type:
-            # Handling string type columns
-            # We will retrieve the `count`, `unique`, `top` and `freq`.
             internal = self._internal.resolved_copy
             exprs_string = [
                 internal.spark_column_for(psser._column_label) for psser in 
psser_string
             ]
             sdf = internal.spark_frame.select(*exprs_string)
 
-            # Get `count` & `unique` for each columns
             counts, uniques = map(lambda x: x[1:], sdf.summary("count", 
"count_distinct").take(2))
-            # Handling Empty DataFrame
             if len(counts) == 0 or counts[0] == "0":
                 data = dict()
                 for psser in psser_string:
                     data[psser.name] = [0, 0, np.nan, np.nan]
                 return DataFrame(data, index=["count", "unique", "top", 
"freq"])
 
-            # Get `top` & `freq` for each columns
-            tops = []
-            freqs = []
-            # TODO(SPARK-37711): We should do it in single pass since invoking 
Spark job
-            #   for every columns is too expensive.
-            for column in exprs_string:
-                top, freq = sdf.groupby(column).count().sort("count", 
ascending=False).first()
-                tops.append(str(top))
-                freqs.append(str(freq))
-
+            n_cols = len(column_names)
+            stack_args = ", ".join([f"'{col_name}', `{col_name}`" for col_name 
in column_names])

Review Comment:
   Good catch. They are equivalent, but I agree it was confusing. Switched to 
using posexplode on exprs_string directly so the unpivot no longer depends on 
column_names similar to the other implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1) [spark]

Reply via email to