[GitHub] [spark] hagerf commented on issue #26087: [SPARK-29427][SQL] Create KeyValueGroupedDataset from existing columns in DataFrame

GitBox Wed, 13 Nov 2019 05:38:04 -0800

hagerf commented on issue #26087: [SPARK-29427][SQL] Create 
KeyValueGroupedDataset from existing columns in DataFrame
URL: https://github.com/apache/spark/pull/26087#issuecomment-553407464
 
 
   Ok, I understand your point, and you're actually right that what I'm 
probably wanting is an API change (or extension rather). Let me explain why. 
   
   The method `groupByKey` signals that you what to create some structure which 
is grouped by some key i.e. you want to get a `KeyValueGroupedDataset`. If you 
use `groupBy` you want a `RelationalGroupedDataset` etc etc. Now, my issue is 
that currently the only way for a user to get his hands on a 
`KeyValueGroupedDataset` is to use the method `groupByKey` which takes a 
function `func: T => K`. Why shouldn't I be able to pass just some columns, or 
column-names and get a `KeyValueGroupedDataset`? I'm betting the majority of 
use cases are just key-grouping by some columns, like in almost all my cases 
I've encountered professionally. Why limit it to a function that takes no 
regard to the partitioning and always shuffles? Adding this method (I would 
prefer calling it `groupByKey` but taking `Seq[String]`) would naturally extend 
the API to cover a common use case which at the same time may prevent expensive 
shuffling.
   
   I have seen many people googling "cogroup for dataframes" look below at some 
example links, because the API currently is a bit unclear. If I wanted to do 
`cogroup` for dataframes of course I could do `.groupByKey(_.getLong(3))` or 
something similar, but with these changes I could instead just write 
`.groupByKey("id")` and then automatically avoid any extra shuffling if the DF 
was already partitioned by id.
   
   https://stackoverflow.com/questions/36513574/cogroup-on-spark-dataframes
   
http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-dataframe-cogroup-td26463.html
   
   @HyukjinKwon I don't really understand how the code you wrote would work as 
a work around. The constructor for `KeyGroupedDataset` is private  along with 
the encoders etc. If you have some work around that I could use to get it 
working that would be awsome.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] hagerf commented on issue #26087: [SPARK-29427][SQL] Create KeyValueGroupedDataset from existing columns in DataFrame

Reply via email to