Yuval Tanny created SPARK-11868:
-----------------------------------

             Summary: wrong results returned from dataframe create from Rows 
without consistent schma on pyspark
                 Key: SPARK-11868
                 URL: https://issues.apache.org/jira/browse/SPARK-11868
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.5.1
         Environment: pyspark
            Reporter: Yuval Tanny


When schema is inconsistent (but is the sames for the 10 first rows), it's 
possible to create a dataframe form dictionaries and if a key is missing, its 
value is None. But when trying to create dataframe from corresponding rows, we 
get inconsistent behavior (wrong values for keys) without exception. See 
example below.

The problems seems to be:
1. Not verifying all rows in schema.
2. In pyspark.sql.types._create_converter, None is being set when converting 
dictionary and field is not exist:
{code}
return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])
{code}
But for Rows, it is just assumed that the number of fields in tuple is equal 
the number of in the inferred schema, and we place wrong values for wrong keys 
otherwise:
{code}
return tuple(conv(v) for v, conv in zip(obj, converters))
{code}

Thanks. 

example:
{code}
dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
rows = [pyspark.sql.Row(**r) for r in dicts]
rows_rdd = sc.parallelize(rows)
dicts_rdd = sc.parallelize(dicts)
rows_df = sqlContext.createDataFrame(rows_rdd)
dicts_df = sqlContext.createDataFrame(dicts_rdd)

print(rows_df.select(['2']).collect()[10])
print(dicts_df.select(['2']).collect()[10])
{code}
output:
{code}
Row(2=3)
Row(2=None)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to