[
https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-32294.
----------------------------------
Fix Version/s: 3.3.0
Resolution: Cannot Reproduce
This is fixed in the master branch. I can't reproduce it anymore:
{code}
>>> df = spark.range(1024 * 1024 * 1024 * 1).selectExpr("1 as a", "1 as b", "1
>>> as c").coalesce(1) # More than 2GB to hit ARROW-4890.
>>> df.groupby("a").applyInPandas(lambda pdf: pdf, schema=df.schema).count()
1073741824
{code}
> GroupedData Pandas UDF 2Gb limit
> --------------------------------
>
> Key: SPARK-32294
> URL: https://issues.apache.org/jira/browse/SPARK-32294
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.0.0, 3.1.0
> Reporter: Ruslan Dautkhanov
> Priority: Major
> Fix For: 3.3.0
>
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for
> GroupedData, the whole group is passed to Pandas UDF at once, which can cause
> various 2Gb limitations on Arrow side (and in current versions of Arrow, also
> 2Gb limitation on Netty allocator side) -
> https://issues.apache.org/jira/browse/ARROW-4890
> Would be great to consider feeding GroupedData into a pandas UDF in batches
> to solve this issue.
> cc [~hyukjin.kwon]
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]