[GitHub] [spark] hagerf edited a comment on issue #26087: [SPARK-29427][SQL] Create KeyValueGroupedDataset from existing columns in DataFrame

GitBox Tue, 12 Nov 2019 02:15:57 -0800

hagerf edited a comment on issue #26087: [SPARK-29427][SQL] Create 
KeyValueGroupedDataset from existing columns in DataFrame
URL: https://github.com/apache/spark/pull/26087#issuecomment-552827385
 
 
   I don't understand why this would be considered a corner case. I've seen 
multiple people requesting similar features online. When working with huge data 
sources often regular joins simply are not performant enough so we are forced 
to use cogroup on `KeyValueGroupedDataset` (or RDDs). But that the only way to 
create them is by using inefficient groupByKey with a "cataylst-invisible" 
function has a significant performance hit.
   
   @HyukjinKwon What about renaming the method to `groupByKey` so that we don't 
have a new function name for it? That way we don't have a new API for it, but 
will be able to use the power of `KeyValueGroupedDataset` without having to 
compromise with unnecessary shuffling.
   @viirya I have added a PR with this renaming, to make it easy.
   
   Again, I want to say that this is a real use-case and daily I'm wasting 
resources because of this, so it would be very helpful. 
   Thanks a lot everyone  🙂


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] hagerf edited a comment on issue #26087: [SPARK-29427][SQL] Create KeyValueGroupedDataset from existing columns in DataFrame

Reply via email to