[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

Hyukjin Kwon (JIRA) Mon, 18 Feb 2019 18:29:34 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771507#comment-16771507
 ]


Hyukjin Kwon commented on SPARK-26858:
--------------------------------------

{quote}
If I understand, this is the case where Spark actually doesn't care much about 
the schema but sounds like Arrow does.
{quote}

Yes, correct. Spark uses {{binary}} type as a container to ship R data frame 
but Arrow requires to set {{Schema}}.

{quote}
Could we infer the schema from R data.frame?
{quote}

That's possible. I think this is virtually similar way of getting the schema 
from the first Arrow batch. The problem is, we don't know the output's schema 
before actually executing the given R native function.
So .. the things to deal with to use this approach:

1. Somehow execute R native function only once, and somehow get the schema from 
(R data frame or Arrow batch)
2. Because the current Arrow code path in Apache spark always uses Spark's 
schema ahead, it needs some codes to extract the schema from the first batch 
(or R data.frame)

I tried this approach and looked pretty hacky. It needs to send back the schema 
from executor to driver (because the R native function is executed at executor 
side).

{quote}
Is there an equivalent way for Python Pandas to Arrow?
{quote}

One way in Python side is {{pyarrow.Table.from_batches}} API which allows to 
create Arrow table from batches directly, meaning from:

{code}
|-------------|
| Arrow Batch |
|-------------|
| Arrow Batch |
|-------------|
   ...
{code}

In this case, we don't need to know the {{Schema}} but R side seems not having 
this API in Arrow. If R side has this API, the workaround might be possible.
However, the protocol will still be different since basically we're not able to 
send a complete Arrow streaming format:

{code}
| Schema      |
|-------------|
| Arrow Batch |
|-------------|
| Arrow Batch |
|-------------|
   ...
{code}

If I remember correctly, Python side in PySpark always complies this format. R 
side in SparkR so far complies this format too.
In order to skip {{Schema}}, it needs a different API usage even if an API like 
{{pyarrow.Table.from_batches}} is possible in R Arrow API.

([~bryanc], correct me if I am wrong at any point ..)

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-26858
>                 URL: https://issues.apache.org/jira/browse/SPARK-26858
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SparkR, SQL
>    Affects Versions: 3.0.0
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Major
>
> Unlike gapply, gapplyCollect requires additional ser/de steps because it can 
> omit the schema, and Spark SQL doesn't know the return type before actually 
> execution happens.
> In original code path, it's done via using binary schema. Once gapply is done 
> (SPARK-26761). we can mimic this approach in vectorized gapply to support 
> gapplyCollect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

Reply via email to