[ https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Cutler resolved SPARK-25801. ---------------------------------- Resolution: Fixed Fix Version/s: 2.4.0 > pandas_udf grouped_map fails with input dataframe with more than 255 columns > ---------------------------------------------------------------------------- > > Key: SPARK-25801 > URL: https://issues.apache.org/jira/browse/SPARK-25801 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.0 > Environment: python 2.7 > pyspark 2.3.0 > Reporter: Frederik > Priority: Major > Fix For: 2.4.0 > > > Hi, > I'm using a pandas_udf to deploy a model to predict all samples in a spark > dataframe, > for this I use a udf as follows: > @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def > predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return > pd.DataFrame({'scores': score_values}) > So it takes a dataframe and predicts the probability of being positive > according to an sklearn model for each row and returns this as single column. > This works great on a random groupBy, e.g.: > sdf_to_score.groupBy(sf.col('age')).apply(predict_scores) > as long as the dataframe has <255 columns. When the input dataframe has more > than 255 columns (thus features in my model), I get: > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > mapper = eval(mapper_str, udfs) > File "<string>", line 1 > SyntaxError: more than 255 arguments > Which seems to be related with Python's general limitation of having not > allowing more than 255 arguments for a function? > > Is this a bug or is there a straightforward way around this problem? > > Regards, > Frederik -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org