Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-28 Thread jeff saremi
I have to read up on the writer. But would the writer get records back from 
somewhere? I want to do a bulk operation and continue with the results in the 
form of a dataframe.

Currently the UDF does this: 1 scalar -> 1 scalar

the UDAF does this: M records -> 1 scalar

I want this: M records -> M records (or M scalars)
or in the broadest sense: M records -> N records

I think this capability is left out of sparksql forcing us to go back to spark 
core using map*, groupby*, and reduceby* functions and alike

Being forced to keep converting between sql and non-sql is very annoying as 
such forcing us to stay conservative and just make do without sql. I'm sure 
we're not alone here.



From: Aaron Perrin 
Sent: Tuesday, June 27, 2017 4:50:25 PM
To: Ryan; jeff saremi
Cc: user@spark.apache.org
Subject: Re: What is the equivalent of mapPartitions in SpqrkSQL?

I'm assuming some things here, but hopefully I understand. So, basically you 
have a big table of data distributed across a bunch of executors. And, you want 
an efficient way to call a native method for each row.

It sounds similar to a dataframe writer to me. Except, instead of writing to 
disk or network, you're 'writing' to a native function. Would a custom 
dataframe writer work? That's what I'd try first.

If that doesn't work for your case, you could also try adding a column where 
the column function does the native call. However, if doing it that way, you'd 
have to ensure that the column function actually gets called for all rows. (An 
interesting side effect of that is that you could JNI/WinAPI errors there and 
set the column value to the result.)

There are other ways, too, if those options don't work...

On Sun, Jun 25, 2017 at 8:07 PM jeff saremi 
> wrote:

My specific and immediate need is this: We have a native function wrapped in 
JNI. To increase performance we'd like to avoid calling it record by record. 
mapPartitions() give us the ability to invoke this in bulk. We're looking for a 
similar approach in SQL.



From: Ryan >
Sent: Sunday, June 25, 2017 7:18:32 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: What is the equivalent of mapPartitions in SpqrkSQL?

Why would you like to do so? I think there's no need for us to explicitly ask 
for a forEachPartition in spark sql because tungsten is smart enough to figure 
out whether a sql operation could be applied on each partition or there has to 
be a shuffle.

On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:

You can do a map() using a select and functions/UDFs. But how do you process a 
partition using SQL?




Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-27 Thread Aaron Perrin
I'm assuming some things here, but hopefully I understand. So, basically
you have a big table of data distributed across a bunch of executors. And,
you want an efficient way to call a native method for each row.

It sounds similar to a dataframe writer to me. Except, instead of writing
to disk or network, you're 'writing' to a native function. Would a custom
dataframe writer work? That's what I'd try first.

If that doesn't work for your case, you could also try adding a column
where the column function does the native call. However, if doing it that
way, you'd have to ensure that the column function actually gets called for
all rows. (An interesting side effect of that is that you could JNI/WinAPI
errors there and set the column value to the result.)

There are other ways, too, if those options don't work...

On Sun, Jun 25, 2017 at 8:07 PM jeff saremi  wrote:

> My specific and immediate need is this: We have a native function wrapped
> in JNI. To increase performance we'd like to avoid calling it record by
> record. mapPartitions() give us the ability to invoke this in bulk. We're
> looking for a similar approach in SQL.
>
>
> --
> *From:* Ryan 
> *Sent:* Sunday, June 25, 2017 7:18:32 PM
> *To:* jeff saremi
> *Cc:* user@spark.apache.org
> *Subject:* Re: What is the equivalent of mapPartitions in SpqrkSQL?
>
> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
ok.. for plain sql, I've no idea other than defining a udaf



On Mon, Jun 26, 2017 at 10:59 AM, jeff saremi 
wrote:

> My specific and immediate need is this: We have a native function wrapped
> in JNI. To increase performance we'd like to avoid calling it record by
> record. mapPartitions() give us the ability to invoke this in bulk. We're
> looking for a similar approach in SQL.
>
>
> --
> *From:* Ryan 
> *Sent:* Sunday, June 25, 2017 7:18:32 PM
> *To:* jeff saremi
> *Cc:* user@spark.apache.org
> *Subject:* Re: What is the equivalent of mapPartitions in SpqrkSQL?
>
> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
My specific and immediate need is this: We have a native function wrapped in 
JNI. To increase performance we'd like to avoid calling it record by record. 
mapPartitions() give us the ability to invoke this in bulk. We're looking for a 
similar approach in SQL.



From: Ryan 
Sent: Sunday, June 25, 2017 7:18:32 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: What is the equivalent of mapPartitions in SpqrkSQL?

Why would you like to do so? I think there's no need for us to explicitly ask 
for a forEachPartition in spark sql because tungsten is smart enough to figure 
out whether a sql operation could be applied on each partition or there has to 
be a shuffle.

On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:

You can do a map() using a select and functions/UDFs. But how do you process a 
partition using SQL?




Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Do you mean you'd like to partition the data with specific key?

If we issue a cluster by/repartition, following an operation needn't
shuffle, it's effectively the same as for each partition I think.

Or we could always get the underlying rdd from dataset, translating sql
operation to function...

On Mon, Jun 26, 2017 at 10:24 AM, Stephen Boesch  wrote:

> Spark SQL did not support explicit partitioners even before tungsten: and
> often enough this did hurt performance.  Even now Tungsten will not do the
> best job every time: so the question from the OP is still germane.
>
> 2017-06-25 19:18 GMT-07:00 Ryan :
>
>> Why would you like to do so? I think there's no need for us to explicitly
>> ask for a forEachPartition in spark sql because tungsten is smart enough to
>> figure out whether a sql operation could be applied on each partition or
>> there has to be a shuffle.
>>
>> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
>> wrote:
>>
>>> You can do a map() using a select and functions/UDFs. But how do you
>>> process a partition using SQL?
>>>
>>>
>>>
>>
>


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and
often enough this did hurt performance.  Even now Tungsten will not do the
best job every time: so the question from the OP is still germane.

2017-06-25 19:18 GMT-07:00 Ryan :

> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Why would you like to do so? I think there's no need for us to explicitly
ask for a forEachPartition in spark sql because tungsten is smart enough to
figure out whether a sql operation could be applied on each partition or
there has to be a shuffle.

On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
wrote:

> You can do a map() using a select and functions/UDFs. But how do you
> process a partition using SQL?
>
>
>