[
https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325371#comment-16325371
]
Hyukjin Kwon commented on SPARK-22980:
--------------------------------------
They are already well explained in documentation. It expects Pandas serise as
input and output. Newly added description added few more details respecting the
example above and what you said.
Input and output are designed to be Pandas series for vectorised operations in
pandas_udf, Scalar vectorised UDFs as documented. Therefore, builtin function
on this works as Pandas series, not as a each value because the input is Pandas
series as documented. It produce expected results.
Both are not error cases and produced meaningful results. They are different so
it produced the different results which are documented.
> Using pandas_udf when inputs are not Pandas's Series or DataFrame
> -----------------------------------------------------------------
>
> Key: SPARK-22980
> URL: https://issues.apache.org/jira/browse/SPARK-22980
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 2.3.0
> Reporter: Xiao Li
> Fix For: 2.3.0
>
>
> {noformat}
> from pyspark.sql.functions import pandas_udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = pandas_udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> {noformat}
> from pyspark.sql.functions import udf
> from pyspark.sql.functions import col, lit
> from pyspark.sql.types import LongType
> df = spark.range(3)
> f = udf(lambda x, y: len(x) + y, LongType())
> df.select(f(lit('text'), col('id'))).show()
> {noformat}
> The results of pandas_udf are different from udf.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]