[ 
https://issues.apache.org/jira/browse/SPARK-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7902:
---------------------------------
    Description: 
We don't convert Python SQL internal types to Python types in SQL UDF 
execution. This causes problems if the input arguments contain UDTs or the 
return type is a UDT. Right now, the raw SQL types are passed into the Python 
UDF and the return value is not converted to Python SQL types.

This is the code (from [~rams]) to produce this bug. (Actually, it triggers 
another bug first right now.)
{code}
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
sz = udf(lambda s: s.size, IntegerType())
df.select(sz(df.features).alias("sz")).collect()
{code}

  was:
We don't convert Python SQL internal types to Python types in SQL UDF 
execution. This causes problems if the input arguments contain UDTs or the 
return type is a UDT. Right now, the raw SQL types are passed into the Python 
UDF and the return value is not converted to Python SQL types.

This is the code to produce this bug. (Actually, it triggers another bug first 
right now.)
{code}
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
sz = udf(lambda s: s.size, IntegerType())
df.select(sz(df.features).alias("sz")).collect()
{code}


> SQL UDF doesn't support UDT in PySpark
> --------------------------------------
>
>                 Key: SPARK-7902
>                 URL: https://issues.apache.org/jira/browse/SPARK-7902
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0
>            Reporter: Xiangrui Meng
>
> We don't convert Python SQL internal types to Python types in SQL UDF 
> execution. This causes problems if the input arguments contain UDTs or the 
> return type is a UDT. Right now, the raw SQL types are passed into the Python 
> UDF and the return value is not converted to Python SQL types.
> This is the code (from [~rams]) to produce this bug. (Actually, it triggers 
> another bug first right now.)
> {code}
> from pyspark.mllib.linalg import SparseVector
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
> sz = udf(lambda s: s.size, IntegerType())
> df.select(sz(df.features).alias("sz")).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to