GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/22653

    [SPARK-25659][PYTHON] Test type inference specification for createDataFrame 
in PySpark

    ## What changes were proposed in this pull request?
    
    This PR proposes to specify type inference and simple e2e tests. Looks we 
are not cleanly testing those logics. 
    
    For instance, see 
https://github.com/apache/spark/blob/08c76b5d39127ae207d9d1fff99c2551e6ce2581/python/pyspark/sql/types.py#L894-L905
    
    Looks we intended to support datetime.time and None for type inference too 
but it does not work:
    
    ```
    >>> spark.createDataFrame([[datetime.time()]])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/sql/session.py", line 751, in 
createDataFrame
        rdd, schema = self._createFromLocal(map(prepare, data), schema)
      File "/.../spark/python/pyspark/sql/session.py", line 432, in 
_createFromLocal
        data = [schema.toInternal(row) for row in data]
      File "/.../spark/python/pyspark/sql/types.py", line 604, in toInternal
        for f, v, c in zip(self.fields, obj, self._needConversion))
      File "/.../spark/python/pyspark/sql/types.py", line 604, in <genexpr>
        for f, v, c in zip(self.fields, obj, self._needConversion))
      File "/.../spark/python/pyspark/sql/types.py", line 442, in toInternal
        return self.dataType.toInternal(obj)
      File "/.../spark/python/pyspark/sql/types.py", line 193, in toInternal
        else time.mktime(dt.timetuple()))
    AttributeError: 'datetime.time' object has no attribute 'timetuple'
    >>> spark.createDataFrame([[None]])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/sql/session.py", line 751, in 
createDataFrame
        rdd, schema = self._createFromLocal(map(prepare, data), schema)
      File "/.../spark/python/pyspark/sql/session.py", line 419, in 
_createFromLocal
        struct = self._inferSchemaFromList(data, names=schema)
      File "/.../python/pyspark/sql/session.py", line 353, in 
_inferSchemaFromList
        raise ValueError("Some of types cannot be determined after inferring")
    ValueError: Some of types cannot be determined after inferring
    ```
    ## How was this patch tested?
    
    Manual tests and unit tests were added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-25659

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22653.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22653
    
----
commit 8fddb9dca26ce36be1f7eaf0d356bf78070486f9
Author: hyukjinkwon <gurwls223@...>
Date:   2018-10-06T09:26:04Z

    Test type inference specification for createDataFrame in PySpark

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to