[ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216695#comment-15216695 ]
Jason C Lee commented on SPARK-13802: ------------------------------------- I tried making a fix where I treat a kwarg row as a dict, and reorder the row based on the schema name. But my fix failed one of the test cases under python/pyspark/sql/tests.py {noformat} def test_toDF_with_schema_string(self): data = [Row(key=i, value=str(i)) for i in range(100)] rdd = self.sc.parallelize(data, 5) df = rdd.toDF("key: int, value: string") self.assertEqual(df.schema.simpleString(), "struct<key:int,value:string>") self.assertEqual(df.collect(), data) # different but compatible field types can be used. df = rdd.toDF("key: string, value: string") self.assertEqual(df.schema.simpleString(), "struct<key:string,value:string>") self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for i in range(100)]) # field names can differ. df = rdd.toDF(" a: int, b: string ") self.assertEqual(df.schema.simpleString(), "struct<a:int,b:string>") self.assertEqual(df.collect(), data) {noformat} This shows that the schema names don't have to correspond to the row's names. Rows are ordered based on row's names, not schema names. By providing schema names, you are essentially 'renaming' the column names. So, maybe a better approach for you is to leave out the schema. This way the schema can just be inferred from the first row: {noformat} row = Row(id="39", first_name="Szymon") df = sqlContext.createDataFrame([row]) df.show(1) +----------+---+ |first_name| id| +----------+---+ | Szymon| 39| +----------+---+ {noformat} > Fields order in Row(**kwargs) is not consistent with Schema.toInternal method > ----------------------------------------------------------------------------- > > Key: SPARK-13802 > URL: https://issues.apache.org/jira/browse/SPARK-13802 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.0 > Reporter: Szymon Matejczyk > > When using Row constructor from kwargs, fields in the tuple underneath are > sorted by name. When Schema is reading the row, it is not using the fields in > this order. > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > schema = StructType([ > StructField("id", StringType()), > StructField("first_name", StringType())]) > row = Row(id="39", first_name="Szymon") > schema.toInternal(row) > Out[5]: ('Szymon', '39') > {code} > {code} > df = sqlContext.createDataFrame([row], schema) > df.show(1) > +------+----------+ > | id|first_name| > +------+----------+ > |Szymon| 39| > +------+----------+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org