I have to read up on the writer. But would the writer get records back from
somewhere? I want to do a bulk operation and continue with the results in the
form of a dataframe.
Currently the UDF does this: 1 scalar -> 1 scalar
the UDAF does this: M records -> 1 scalar
I want this: M records ->
I'm assuming some things here, but hopefully I understand. So, basically
you have a big table of data distributed across a bunch of executors. And,
you want an efficient way to call a native method for each row.
It sounds similar to a dataframe writer to me. Except, instead of writing
to disk or
ok.. for plain sql, I've no idea other than defining a udaf
On Mon, Jun 26, 2017 at 10:59 AM, jeff saremi
wrote:
> My specific and immediate need is this: We have a native function wrapped
> in JNI. To increase performance we'd like to avoid calling it record by
>
My specific and immediate need is this: We have a native function wrapped in
JNI. To increase performance we'd like to avoid calling it record by record.
mapPartitions() give us the ability to invoke this in bulk. We're looking for a
similar approach in SQL.
Do you mean you'd like to partition the data with specific key?
If we issue a cluster by/repartition, following an operation needn't
shuffle, it's effectively the same as for each partition I think.
Or we could always get the underlying rdd from dataset, translating sql
operation to function...
Spark SQL did not support explicit partitioners even before tungsten: and
often enough this did hurt performance. Even now Tungsten will not do the
best job every time: so the question from the OP is still germane.
2017-06-25 19:18 GMT-07:00 Ryan :
> Why would you like to
Why would you like to do so? I think there's no need for us to explicitly
ask for a forEachPartition in spark sql because tungsten is smart enough to
figure out whether a sql operation could be applied on each partition or
there has to be a shuffle.
On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi