[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bago Amirbekian updated SPARK-22232: ------------------------------------ Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:none} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:python} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -------------------------------------------------------------------------------------------------- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.2.0 > Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org