[ https://issues.apache.org/jira/browse/SPARK-23334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-23334: --------------------------------- Issue Type: Sub-task (was: Bug) Parent: SPARK-22216 > Fix pandas_udf with return type StringType() to handle str type properly in > Python 2. > ------------------------------------------------------------------------------------- > > Key: SPARK-23334 > URL: https://issues.apache.org/jira/browse/SPARK-23334 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.3.0 > Reporter: Takuya Ueshin > Assignee: Takuya Ueshin > Priority: Blocker > Fix For: 2.3.0 > > > In Python 2, when pandas_udf tries to return string type value created in the > udf with {{".."}}, the execution fails. E.g., > {code:java} > from pyspark.sql.functions import pandas_udf, col > import pandas as pd > df = spark.range(10) > str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") > df.select(str_f(col('id'))).show() > {code} > raises the following exception: > {code} > ... > java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: > expected StringType, got BinaryType > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93) > ... > {code} > Seems like pyarrow ignores {{type}} parameter for {{pa.Array.from_pandas()}} > and consider it as binary type when the type is string type and the string > values are {{str}} instead of {{unicode}} in Python 2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org