hagerf commented on issue #26087: [SPARK-29427][SQL] Create KeyValueGroupedDataset from existing columns in DataFrame URL: https://github.com/apache/spark/pull/26087#issuecomment-553407464 Ok, I understand your point, and you're actually right that what I'm probably wanting is an API change (or extension rather). Let me explain why. The method `groupByKey` signals that you what to create some structure which is grouped by some key i.e. you want to get a `KeyValueGroupedDataset`. If you use `groupBy` you want a `RelationalGroupedDataset` etc etc. Now, my issue is that currently the only way for a user to get his hands on a `KeyValueGroupedDataset` is to use the method `groupByKey` which takes a function `func: T => K`. Why shouldn't I be able to pass just some columns, or column-names and get a `KeyValueGroupedDataset`? I'm betting the majority of use cases are just key-grouping by some columns, like in almost all my cases I've encountered professionally. Why limit it to a function that takes no regard to the partitioning and always shuffles? Adding this method (I would prefer calling it `groupByKey` but taking `Seq[String]`) would naturally extend the API to cover a common use case which at the same time may prevent expensive shuffling. I have seen many people googling "cogroup for dataframes" look below at some example links, because the API currently is a bit unclear. If I wanted to do `cogroup` for dataframes of course I could do `.groupByKey(_.getLong(3))` or something similar, but with these changes I could instead just write `.groupByKey("id")` and then automatically avoid any extra shuffling if the DF was already partitioned by id. https://stackoverflow.com/questions/36513574/cogroup-on-spark-dataframes http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-dataframe-cogroup-td26463.html @HyukjinKwon I don't really understand how the code you wrote would work as a work around. The constructor for `KeyGroupedDataset` is private along with the encoders etc. If you have some work around that I could use to get it working that would be awsome.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
