hagerf edited a comment on issue #26087: [SPARK-29427][SQL] Create KeyValueGroupedDataset from existing columns in DataFrame URL: https://github.com/apache/spark/pull/26087#issuecomment-552827385 I don't understand why this would be considered a corner case. I've seen multiple people requesting similar features online. When working with huge data sources often regular joins simply are not performant enough so we are forced to use cogroup on `KeyValueGroupedDataset` (or RDDs). But that the only way to create them is by using inefficient groupByKey with a "cataylst-invisible" function has a significant performance hit. @HyukjinKwon What about renaming the method to `groupByKey` so that we don't have a new function name for it? That way we don't have a new API for it, but will be able to use the power of `KeyValueGroupedDataset` without having to compromise with unnecessary shuffling. @viirya I have added a PR with this renaming, to make it easy. Again, I want to say that this is a real use-case and daily I'm wasting resources because of this, so it would be very helpful. Thanks a lot everyone 🙂
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
