Do you mean you'd like to partition the data with specific key? If we issue a cluster by/repartition, following an operation needn't shuffle, it's effectively the same as for each partition I think.
Or we could always get the underlying rdd from dataset, translating sql operation to function... On Mon, Jun 26, 2017 at 10:24 AM, Stephen Boesch <java...@gmail.com> wrote: > Spark SQL did not support explicit partitioners even before tungsten: and > often enough this did hurt performance. Even now Tungsten will not do the > best job every time: so the question from the OP is still germane. > > 2017-06-25 19:18 GMT-07:00 Ryan <ryan.hd....@gmail.com>: > >> Why would you like to do so? I think there's no need for us to explicitly >> ask for a forEachPartition in spark sql because tungsten is smart enough to >> figure out whether a sql operation could be applied on each partition or >> there has to be a shuffle. >> >> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi <jeffsar...@hotmail.com> >> wrote: >> >>> You can do a map() using a select and functions/UDFs. But how do you >>> process a partition using SQL? >>> >>> >>> >> >