kachayev opened a new pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133
 
 
   ### What changes were proposed in this pull request?
   
   PR introduces overloaded methods for
   * `stat.approxQuantile`
   * `stat.corr`
   * `stat.cov`
   * `stat.crosstab`
   
   to work with arguments passed as `Column`s rather than column names.
   
   ### Why are the changes needed?
   
   Some other functions from `StatFunctions` module already provide 
`Column`-based versions, namely:
   * `stat.bloomFilter`
   * `stat.countMinSketch`
   * `stat.sampleBy`
   
   The change proposed allows more flexible usage patterns along side with API 
consistency.
   
   ### Does this PR introduce any user-facing change?
   
   Yes, new signatures for stat API functions.
   
   ### How was this patch tested?
   
   Corresponding test cases are included.
   
   ### Thoughts
   
   * Python and R API should probably also be revisited (I can do this in 
separate PR or include here)
   * `stat.freqItems` could not be overloaded to provide same functionality 
(because of type erasure)
   * I decided to keep old versions of helper functions from `StatFunctions`, 
e.g. `pearsonCorrelation` and add new `pearsonCorrelationByColumn` (it seems 
like API is internal but it's public and removal of public method might 
introduce issues)
   * `resolveColumn` helper introduces for `StatFunctions` should probably be 
the part of `Dataset` API (e.g. `Dataset.drop` uses quite the same approach 
when dealing with `Column` arguments), should I move it in this PR or create 
another one (don't want to put to many changes in a single bucket)?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to