[
https://issues.apache.org/jira/browse/SPARK-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng updated SPARK-7902:
---------------------------------
Description:
We don't convert Python SQL internal types to Python types in SQL UDF
execution. This causes problems if the input arguments contain UDTs or the
return type is a UDT. Right now, the raw SQL types are passed into the Python
UDF and the return value is not converted to Python SQL types.
This is the code (from [~rams]) to produce this bug. (Actually, it triggers
another bug first right now.)
{code}
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
sz = udf(lambda s: s.size, IntegerType())
df.select(sz(df.features).alias("sz")).collect()
{code}
was:
We don't convert Python SQL internal types to Python types in SQL UDF
execution. This causes problems if the input arguments contain UDTs or the
return type is a UDT. Right now, the raw SQL types are passed into the Python
UDF and the return value is not converted to Python SQL types.
This is the code to produce this bug. (Actually, it triggers another bug first
right now.)
{code}
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
sz = udf(lambda s: s.size, IntegerType())
df.select(sz(df.features).alias("sz")).collect()
{code}
> SQL UDF doesn't support UDT in PySpark
> --------------------------------------
>
> Key: SPARK-7902
> URL: https://issues.apache.org/jira/browse/SPARK-7902
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.4.0
> Reporter: Xiangrui Meng
>
> We don't convert Python SQL internal types to Python types in SQL UDF
> execution. This causes problems if the input arguments contain UDTs or the
> return type is a UDT. Right now, the raw SQL types are passed into the Python
> UDF and the return value is not converted to Python SQL types.
> This is the code (from [~rams]) to produce this bug. (Actually, it triggers
> another bug first right now.)
> {code}
> from pyspark.mllib.linalg import SparseVector
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
> sz = udf(lambda s: s.size, IntegerType())
> df.select(sz(df.features).alias("sz")).collect()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]