kachayev opened a new pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133 ### What changes were proposed in this pull request? PR introduces overloaded methods for * `stat.approxQuantile` * `stat.corr` * `stat.cov` * `stat.crosstab` to work with arguments passed as `Column`s rather than column names. ### Why are the changes needed? Some other functions from `StatFunctions` module already provide `Column`-based versions, namely: * `stat.bloomFilter` * `stat.countMinSketch` * `stat.sampleBy` The change proposed allows more flexible usage patterns along side with API consistency. ### Does this PR introduce any user-facing change? Yes, new signatures for stat API functions. ### How was this patch tested? Corresponding test cases are included. ### Thoughts * Python and R API should probably also be revisited (I can do this in separate PR or include here) * `stat.freqItems` could not be overloaded to provide same functionality (because of type erasure) * I decided to keep old versions of helper functions from `StatFunctions`, e.g. `pearsonCorrelation` and add new `pearsonCorrelationByColumn` (it seems like API is internal but it's public and removal of public method might introduce issues) * `resolveColumn` helper introduces for `StatFunctions` should probably be the part of `Dataset` API (e.g. `Dataset.drop` uses quite the same approach when dealing with `Column` arguments), should I move it in this PR or create another one (don't want to put to many changes in a single bucket)?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org