[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26410:
----------------------------------
    Description: 
We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
"right" batch size usually depends on the task itself. It would be nice if user 
can configure the batch size when they declare the Pandas UDF.

This is orthogonal to SPARK-23258 (using max buffer size instead of row count).

Besides API, we should also discuss how to merge Pandas UDFs of different 
configurations. For example,

{code}
df.select(predict1(col("features"), predict2(col("features")))
{code}

when predict1 requests 100 rows per batch, while predict2 requests 120 rows per 
batch.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
"right" batch size usually depends on the task itself. It would be nice if user 
can configure the batch size when they declare the Pandas UDF.

This is orthogonal to SPARK-23258 (using max buffer size instead of row count).

Besides API, we should also discuss how to merge Pandas UDFs of different 
configurations. For example,

{code}
df.select(predict1(col("features"), predict2(col("features")))
{code}

when predict1 requests 100 rows per batch, while predict2 requests 120 rows per 
batch.

cc: [~icexelloss] [~bryanc] [~holdenk] [~ueshin] [~smilegator]


> Support per Pandas UDF configuration
> ------------------------------------
>
>                 Key: SPARK-26410
>                 URL: https://issues.apache.org/jira/browse/SPARK-26410
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to