Do you mean you'd like to partition the data with specific key?

If we issue a cluster by/repartition, following an operation needn't
shuffle, it's effectively the same as for each partition I think.

Or we could always get the underlying rdd from dataset, translating sql
operation to function...

On Mon, Jun 26, 2017 at 10:24 AM, Stephen Boesch <java...@gmail.com> wrote:

> Spark SQL did not support explicit partitioners even before tungsten: and
> often enough this did hurt performance.  Even now Tungsten will not do the
> best job every time: so the question from the OP is still germane.
>
> 2017-06-25 19:18 GMT-07:00 Ryan <ryan.hd....@gmail.com>:
>
>> Why would you like to do so? I think there's no need for us to explicitly
>> ask for a forEachPartition in spark sql because tungsten is smart enough to
>> figure out whether a sql operation could be applied on each partition or
>> there has to be a shuffle.
>>
>> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi <jeffsar...@hotmail.com>
>> wrote:
>>
>>> You can do a map() using a select and functions/UDFs. But how do you
>>> process a partition using SQL?
>>>
>>>
>>>
>>
>

Reply via email to