Frederik created SPARK-25801:
--------------------------------

             Summary: pandas_udf grouped_map fails with input dataframe with 
more than 255 columns
                 Key: SPARK-25801
                 URL: https://issues.apache.org/jira/browse/SPARK-25801
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.3.0
         Environment: python 2.7

pyspark 2.3.0
            Reporter: Frederik


Hi,

I'm using a pandas_udf to deploy a model to predict all samples in a spark 
dataframe,

for this I use a udf as follows:
@pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def 
predict_scores(pdf):  score_values = model.predict_proba(pdf)[:,1]  return 
pd.DataFrame({'scores': score_values})
So it takes a dataframe and predicts the probability of being positive 
according to an sklearn model for each row and returns this as single column. 
This works great on a random groupBy, e.g.:
sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
as long as the dataframe has <255 columns. When the input dataframe has more 
than 255 columns (thus features in my model), I get:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, 
in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type)
  File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 148, 
in read_udfs
    mapper = eval(mapper_str, udfs)
  File "<string>", line 1
SyntaxError: more than 255 arguments
Which seems to be related with Python's general limitation of having not 
allowing more than 255 arguments for a function?

 

Is this a bug or is there a straightforward way around this problem?

 

Regards,

Frederik



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to