[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

gberger Wed, 22 Nov 2017 03:22:20 -0800

Github user gberger commented on the issue:

    https://github.com/apache/spark/pull/19792
  
    @ueshin 
    
    The reason that I modified the case for StructType is that, in 
session.py#341, for each Pandas DF row we obtain a StructType with StructFields 
mapping column names to value type; these are reduced with _merge_types.
    
    I do appreciate that it could be the case that a Pandas DF contains lists 
or dicts as values. I pushed a new commit where the `name` property gets passed 
down when we are recursing via ArrayType or MapType.
    
    Here is what it looks like when we use lists or dicts inside a Pandas DF:
    
    ```
    >>> df = pd.DataFrame(data={
    ...     'a': [[1, 2], [3, 4]],
    ...     'b': [[5, 'hello'], [7, 8]]
    ... })
    >>> sdf = sql.createDataFrame(df)
    >>> sdf
    DataFrame[a: array<bigint>, b: array<bigint>]
    >>> sdf.show()
    +------+---------+
    |     a|        b|
    +------+---------+
    |[1, 2]|[5, null]|
    |[3, 4]|   [7, 8]|
    +------+---------+
    ```
    
    ```
    >>> df = pd.DataFrame(data={
    ...     'a': [{1: 2}, {3: 4}],
    ...     'b': [{5: 'hello'}, {7: 8}]
    ... })
    >>> sdf = sql.createDataFrame(df)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/gberger/Projects/spark/python/pyspark/sql/context.py", line 
354, in createDataFrame
        return self.sparkSession.createDataFrame(data, schema, samplingRatio, 
verifySchema)
      File "/Users/gberger/Projects/spark/python/pyspark/sql/session.py", line 
646, in createDataFrame
        rdd, schema = self._createFromLocal(map(prepare, data), schema)
      File "/Users/gberger/Projects/spark/python/pyspark/sql/session.py", line 
409, in _createFromLocal
        struct = self._inferSchemaFromList(data, names=schema)
      File "/Users/gberger/Projects/spark/python/pyspark/sql/session.py", line 
341, in _inferSchemaFromList
        schema = reduce(_merge_type, [_infer_schema(row, names) for row in 
data])
      File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 
1128, in _merge_type
        for f in a.fields]
      File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 
1128, in <listcomp>
        for f in a.fields]
      File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 
1140, in _merge_type
        _merge_type(a.valueType, b.valueType, name=name),
      File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 
1122, in _merge_type
        raise TypeError("Can not merge type %s and %s in field '%s'" % 
(type(a), type(b), name))
    TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and 
<class 'pyspark.sql.types.LongType'> in field b
    ```




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

Reply via email to