Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-28 Thread jeff saremi
I have to read up on the writer. But would the writer get records back from somewhere? I want to do a bulk operation and continue with the results in the form of a dataframe. Currently the UDF does this: 1 scalar -> 1 scalar the UDAF does this: M records -> 1 scalar I want this: M records ->

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-27 Thread Aaron Perrin
I'm assuming some things here, but hopefully I understand. So, basically you have a big table of data distributed across a bunch of executors. And, you want an efficient way to call a native method for each row. It sounds similar to a dataframe writer to me. Except, instead of writing to disk or

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
ok.. for plain sql, I've no idea other than defining a udaf On Mon, Jun 26, 2017 at 10:59 AM, jeff saremi wrote: > My specific and immediate need is this: We have a native function wrapped > in JNI. To increase performance we'd like to avoid calling it record by >

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
My specific and immediate need is this: We have a native function wrapped in JNI. To increase performance we'd like to avoid calling it record by record. mapPartitions() give us the ability to invoke this in bulk. We're looking for a similar approach in SQL.

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Do you mean you'd like to partition the data with specific key? If we issue a cluster by/repartition, following an operation needn't shuffle, it's effectively the same as for each partition I think. Or we could always get the underlying rdd from dataset, translating sql operation to function...

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and often enough this did hurt performance. Even now Tungsten will not do the best job every time: so the question from the OP is still germane. 2017-06-25 19:18 GMT-07:00 Ryan : > Why would you like to

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Why would you like to do so? I think there's no need for us to explicitly ask for a forEachPartition in spark sql because tungsten is smart enough to figure out whether a sql operation could be applied on each partition or there has to be a shuffle. On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi