[jira] [Resolved] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns

Bryan Cutler (JIRA) Tue, 23 Oct 2018 14:09:37 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Cutler resolved SPARK-25801.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 2.4.0

> pandas_udf grouped_map fails with input dataframe with more than 255 columns
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-25801
>                 URL: https://issues.apache.org/jira/browse/SPARK-25801
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: python 2.7
> pyspark 2.3.0
>            Reporter: Frederik
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Hi,
> I'm using a pandas_udf to deploy a model to predict all samples in a spark 
> dataframe,
> for this I use a udf as follows:
> @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def 
> predict_scores(pdf):  score_values = model.predict_proba(pdf)[:,1]  return 
> pd.DataFrame({'scores': score_values})
> So it takes a dataframe and predicts the probability of being positive 
> according to an sklearn model for each row and returns this as single column. 
> This works great on a random groupBy, e.g.:
> sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
> as long as the dataframe has <255 columns. When the input dataframe has more 
> than 255 columns (thus features in my model), I get:
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 219, in main
>     func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type)
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 148, in read_udfs
>     mapper = eval(mapper_str, udfs)
>   File "<string>", line 1
> SyntaxError: more than 255 arguments
> Which seems to be related with Python's general limitation of having not 
> allowing more than 255 arguments for a function?
>  
> Is this a bug or is there a straightforward way around this problem?
>  
> Regards,
> Frederik



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns

Reply via email to