Amogh Param created SPARK-18709:
-----------------------------------

             Summary: Failure to throw error and automatic null conversion bug 
when creating a Spark Datarame with incompatible types for fields.
                 Key: SPARK-18709
                 URL: https://issues.apache.org/jira/browse/SPARK-18709
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.3, 1.6.2
            Reporter: Amogh Param
             Fix For: 2.0.2


When converting an RDD with a `float` type field to a spark dataframe with an 
`IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently convert 
the field values to nulls instead of throwing an error like `LongType can not 
accept object ___ in type <type 'float'>`. However, this seems to be fixed in 
Spark 2.0.2.


The following example should make the problem clear:
{code}
from pyspark.sql.types import StructField, StructType, LongType, DoubleType

schema = StructType([
        StructField("0", LongType(), True),
        StructField("1", DoubleType(), True),
    ])

data = [[1.0, 1.0], [nan, 2.0]]
spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
spark_df.show()
{code}

Instead of throwing an error like:
{code}
LongType can not accept object 1.0 in type <type 'float'>
{code}

Spark converts all the values in the first column to nulls

Running `spark_df.show()` gives:
{code}
+----+---+
|   0|  1|
+----+---+
|null|1.0|
|null|1.0|
+----+---+
{code}

For the purposes of my computation, I'm doing a `mapPartitions` on a spark data 
frame, and for each partition, converting it into a pandas data frame, doing a 
few computations on this pandas dataframe and the return value will be a list 
of lists, which is converted to an RDD while being returned from 
'mapPartitions' (for all partitions). This RDD is then converted into a spark 
dataframe similar to the example above, using `sqlContext.createDataFrame(rdd, 
schema)`. The rdd has a column that should be converted to a `LongType` in the 
spark data frame, but since it has missing values, it is a `float` type. When 
spark tries to create the data frame, it converts all the values in that column 
to nulls instead of throwing an error that there is a type mismatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to