Richard Williamson created SPARK-37100:
------------------------------------------
Summary: Pandas groupby UDFs would benefit from automatically
redistributing data on the groupby key in order to prevent network issues
running udf
Key: SPARK-37100
URL: https://issues.apache.org/jira/browse/SPARK-37100
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 3.1.2
Reporter: Richard Williamson
Fix For: 3.2.1
when running high cardinality pandas udf groupby steps (100,000s+ of unique
groups) - jobs will either fail or have high amount of task failures due to
network errors on larger clusters 100+ nodes - this was not the specific code
causing issues but should be close to representative:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.functions import rand
from fancyimpute import IterativeSVD
import numpy as np
import pandas as pd
df = spark.range(0, 100000).withColumn('v', rand())
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def solver(pdf):
pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy()))
return pdf
df.groupby('id').apply(solver).count()
df.repartition('id') – this is required to fix it - can we make this
automatically happen without any adverse impacts?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]