[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

Jason C Lee (JIRA) Tue, 29 Mar 2016 12:24:13 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216695#comment-15216695
 ]


Jason C Lee commented on SPARK-13802:
-------------------------------------

I tried making a fix where I treat a kwarg row as a dict, and reorder the row 
based on the schema name. But my fix failed one of the test cases under 
python/pyspark/sql/tests.py
{noformat}
   def test_toDF_with_schema_string(self):
        data = [Row(key=i, value=str(i)) for i in range(100)]
        rdd = self.sc.parallelize(data, 5)

        df = rdd.toDF("key: int, value: string")
        self.assertEqual(df.schema.simpleString(), 
"struct<key:int,value:string>")
        self.assertEqual(df.collect(), data)

        # different but compatible field types can be used.
        df = rdd.toDF("key: string, value: string")
        self.assertEqual(df.schema.simpleString(), 
"struct<key:string,value:string>")
        self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in 
range(100)])

        # field names can differ.
        df = rdd.toDF(" a: int, b: string ")
        self.assertEqual(df.schema.simpleString(), "struct<a:int,b:string>")
        self.assertEqual(df.collect(), data)
{noformat}

This shows that the schema names don't have to correspond to the row's names. 
Rows are ordered based on row's names, not schema names. By providing schema 
names, you are essentially 'renaming' the column names. 

So, maybe a better approach for you is to leave out the schema. This way the 
schema can just be inferred from the first row:
{noformat}
row = Row(id="39", first_name="Szymon")
df = sqlContext.createDataFrame([row])
df.show(1)

+----------+---+
|first_name| id|
+----------+---+
|    Szymon| 39|
+----------+---+

{noformat}



> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-13802
>                 URL: https://issues.apache.org/jira/browse/SPARK-13802
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>            Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
>     StructField("id", StringType()),
>     StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +------+----------+
> |    id|first_name|
> +------+----------+
> |Szymon|        39|
> +------+----------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

Reply via email to