[GitHub] [spark] MrPowers commented on a change in pull request #31254: [SPARK-34165][SQL] Add count_distinct as an option to Dataset#summary

GitBox Mon, 25 Jan 2021 21:02:21 -0800


MrPowers commented on a change in pull request #31254:
URL: https://github.com/apache/spark/pull/31254#discussion_r564212822




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2670,6 +2670,7 @@ class Dataset[T] private[sql](
    *   <li>min</li>
    *   <li>max</li>
    *   <li>arbitrary approximate percentiles specified as a percentage (e.g. 
75%)</li>
+   *   <li>count_distinct</li>

Review comment:
       @HyukjinKwon - you raise a good point about API ambiguity between 
`Dataset.describe` and `Dataset.summary`, but those ambiguities already exist.  
`df.describe("some_col")`, `df.select("some_col").summary()`, and 
`df.select("some_col").summary("37%", "87%")` all return different results.
   
   `describe` is a "partial alias" of `summary`, but the method signatures are 
different.
   
   The Spark & Pandas `describe` methods aren't consistent either.  The Spark 
`describe` method takes column name arguments whereas the Pandas `describe` 
method takes options, see the Pandas method signature: 
`DataFrame.describe(percentiles=None, include=None, exclude=None, 
datetime_is_numeric=False)`.
   
   A Pandas user is already going to be confused by the API.  They're going to 
run `df.describe(percentiles=Array(.25, .5, .75))` and have to Google around to 
find that they need to run `df.summary("0.25%", "0.5%", "0.75%")`.  
   
   Adding additional options to `Dataset.summary` seems consistent with the 
existing design.  The `summary` method is already where we put the extra 
metrics that aren't enabled by default, like the option to add arbitrary 
percentiles.
   
   BTW - I don't mean to complain here ;). It's easy for me to make comments on 
API decisions made years ago from folks in the trenches.  You are all doing a 
wonderful job maintaining this project and continuously moving it in a good 
direction.  I'm always impressed by the code you write and will hopefully get 
to your level some day.  My PR comments sound critical cause I'm just trying to 
get to the point.  I'm very grateful for all the work you've done on this 
project.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MrPowers commented on a change in pull request #31254: [SPARK-34165][SQL] Add count_distinct as an option to Dataset#summary

Reply via email to