Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
okay i see the partition local sort. got it. i would expect that pushing the partition local sort into shuffle would give a signficicant boost. but thats just a guess. On Fri, Nov 4, 2016 at 2:39 PM, Michael Armbrust wrote: > sure, but then my values are not sorted per

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Michael Armbrust
> > sure, but then my values are not sorted per key, right? It does do a partition local sort. Look at the query plan in my example .

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
i just noticed Sort for Dataset has a global flag. and Dataset also has sortWithinPartitions. how about: repartition + sortWithinPartitions + mapPartitions? the plan looks ok, but it is not clear to me if the sort is done as part of the shuffle (which is the important optimization). scala> val

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
sure, but then my values are not sorted per key, right? so a group by key with values sorted according to to some ordering is an operation that can be done efficiently in a single shuffle without first figuring out range boundaries. and it is needed for quite a few algos, including Window and

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
Thinking out loud is good :) You are right in that anytime you ask for a global ordering from Spark you will pay the cost of figuring out the range boundaries for partitions. If you say orderBy, though, we aren't sure that you aren't expecting a global order. If you only want to make sure that

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i guess i could sort by (hashcode(key), key, secondarySortColumn) and then do mapPartitions? sorry thinking out loud a bit here. ok i think that could work. thanks On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers wrote: > thats an interesting thought about orderBy and

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
thats an interesting thought about orderBy and mapPartitions. i guess i could emulate a groupBy with secondary sort using those two. however isn't using an orderBy expensive since it is a total sort? i mean a groupBy with secondary sort is also a total sort under the hood, but its on

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
> > It is still unclear to me why we should remember all these tricks (or add > lots of extra little functions) when this elegantly can be expressed in a > reduce operation with a simple one line lamba function. > I think you can do that too. KeyValueGroupedDataset has a reduceGroups function.

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Oh okay that makes sense. The trick is to take max on tuple2 so you carry the other column along. It is still unclear to me why we should remember all these tricks (or add lots of extra little functions) when this elegantly can be expressed in a reduce operation with a simple one line lamba

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
You are looking to perform an *argmax*, which you can do with a single aggregation. Here is an example . On Thu, Nov 3, 2016 at 4:53

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I agree with Koert. Relying on something because it appears to work when you test it can be dangerous if there is nothing in the api guarantee. Going back quite a few years it used to be the case that Oracle would always return a group by with the rows in the order of the grouping key. This

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i did not check the claim in that blog post that the data is ordered, but i wouldnt rely on that behavior since it is not something the api guarantees and could change in future versions On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee wrote: > Hi Koert & Robin , >

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread ayan guha
I would go for partition by option. It seems simple and yes, SQL inspired :) On 4 Nov 2016 00:59, "Rabin Banerjee" wrote: > Hi Koert & Robin , > > * Thanks ! *But if you go through the blog https://bzhangusc. >

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi Koert & Robin , * Thanks ! *But if you go through the blog https://bzhangusc.wordpress.co m/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and check the comments under the blog it's actually working, although I am not sure how . And yes I agree a custom aggregate UDAF is a good

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Just realized you only want to keep first element. You can do this without sorting by doing something similar to min or max operation using a custom aggregator/udaf or reduceGroups on Dataset. This is also more efficient. On Nov 3, 2016 7:53 AM, "Rabin Banerjee"

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
What you require is secondary sort which is not available as such for a DataFrame. The Window operator is what comes closest but it is strangely limited in its abilities (probably because it was inspired by a SQL construct instead of a more generic programmatic transformation capability). On Nov

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I don’t think the semantics of groupBy necessarily preserve ordering - whatever the implementation details or the observed behaviour. I would use a Window operation and order within the group. > On 3 Nov 2016, at 11:53, Rabin Banerjee wrote: > > Hi All , > >

Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi All , I want to do a dataframe operation to find the rows having the latest timestamp in each group using the below operation