Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19646#discussion_r149196424
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -512,9 +557,7 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
             except Exception:
                 has_pandas = False
             if has_pandas and isinstance(data, pandas.DataFrame):
    -            if schema is None:
    -                schema = [str(x) for x in data.columns]
    -            data = [r.tolist() for r in data.to_records(index=False)]
    --- End diff --
    
    the problem is that nanosecond values can not be converted to a python 
datetime object, which only has microsecond resolution, so numpy converts it to 
long.  Numpy will convert microseconds and above to python datetime objects, 
which Spark will correctly infer.
    
    > according to the ticket, seems we need to convert numpy.datetime64 to 
python datetime manually.
    
    This fix is just meant to convert nanosecond timestamps to microseconds so 
that calling `tolist()` can fit them in a python object.  Does it seem ok to 
you guys to leave it at that scope for now?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to