[jira] [Commented] (SPARK-20563) going to DataFrame to RDD and back changes the schema, if the schema is not explicitly provided

Bryan Cutler (JIRA) Thu, 04 May 2017 11:45:46 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997205#comment-15997205
 ]


Bryan Cutler commented on SPARK-20563:
--------------------------------------

I think this is to be expected.  An RDD does not define a schema, so the 
conversion to it basically discards it.  Then going back to DataFrame, the 
schema has to be inferred by the data.  Since Python ints can go above 32 bits, 
it will infer a LongType.

> going to DataFrame to RDD and back changes the schema, if the schema is not 
> explicitly provided
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20563
>                 URL: https://issues.apache.org/jira/browse/SPARK-20563
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: Danil Kirsanov
>            Priority: Minor
>
> df.rdd.toDF() converts the DataFrame of IntegerType to the LongType if the 
> schema is not explicitly provided in toDF().
> Below is a full reproduction code
> -------------------------------------
> from pyspark.sql.types import IntegerType, StructType, StructField
> schema = StructType([StructField("a",IntegerType(),True), 
> StructField("b",IntegerType(),True)])
> df_test = spark.createDataFrame([(1,2)], schema)
> df_test.printSchema()
> df_test.rdd.toDF().printSchema()



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20563) going to DataFrame to RDD and back changes the schema, if the schema is not explicitly provided

Reply via email to