Richard Williamson created SPARK-37100:
------------------------------------------

             Summary: Pandas groupby UDFs would benefit from automatically 
redistributing data on the groupby key in order to prevent network issues 
running udf
                 Key: SPARK-37100
                 URL: https://issues.apache.org/jira/browse/SPARK-37100
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 3.1.2
            Reporter: Richard Williamson
             Fix For: 3.2.1


when running high cardinality pandas udf groupby steps (100,000s+ of unique 
groups) - jobs will either fail or have high amount of task failures due to 
network errors on larger clusters 100+ nodes - this was not the specific code 
causing issues but should be close to representative:


from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.functions import rand
from fancyimpute import IterativeSVD
import numpy as np
import pandas as pd
​
df = spark.range(0, 100000).withColumn('v', rand())
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def solver(pdf):
pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy()))
return pdf
​
df.groupby('id').apply(solver).count()
 
df.repartition('id') – this is required to fix it - can we make this 
automatically happen without any adverse impacts?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to