[
https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuval Tanny updated SPARK-11868:
--------------------------------
Affects Version/s: (was: 1.5.1)
1.5.2
> wrong results returned from dataframe create from Rows without consistent
> schma on pyspark
> ------------------------------------------------------------------------------------------
>
> Key: SPARK-11868
> URL: https://issues.apache.org/jira/browse/SPARK-11868
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.5.2
> Environment: pyspark
> Reporter: Yuval Tanny
>
> When schema is inconsistent (but is the sames for the 10 first rows), it's
> possible to create a dataframe form dictionaries and if a key is missing, its
> value is None. But when trying to create dataframe from corresponding rows,
> we get inconsistent behavior (wrong values for keys) without exception. See
> example below.
> The problems seems to be:
> 1. Not verifying all rows in schema.
> 2. In pyspark.sql.types._create_converter, None is being set when converting
> dictionary and field is not exist:
> {code}
> return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])
> {code}
> But for Rows, it is just assumed that the number of fields in tuple is equal
> the number of in the inferred schema, and we place wrong values for wrong
> keys otherwise:
> {code}
> return tuple(conv(v) for v, conv in zip(obj, converters))
> {code}
> Thanks.
> example:
> {code}
> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
> rows = [pyspark.sql.Row(**r) for r in dicts]
> rows_rdd = sc.parallelize(rows)
> dicts_rdd = sc.parallelize(dicts)
> rows_df = sqlContext.createDataFrame(rows_rdd)
> dicts_df = sqlContext.createDataFrame(dicts_rdd)
> print(rows_df.select(['2']).collect()[10])
> print(dicts_df.select(['2']).collect()[10])
> {code}
> output:
> {code}
> Row(2=3)
> Row(2=None)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]