Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19646#discussion_r148709362
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema):
             data = [schema.toInternal(row) for row in data]
             return self._sc.parallelize(data), schema
     
    +    def _getNumpyRecordDtypes(self, rec):
    +        """
    +        Used when converting a pandas.DataFrame to Spark using 
to_records(), this will correct
    +        the dtypes of records so they can be properly loaded into Spark.
    +        :param rec: a numpy record to check dtypes
    +        :return corrected dtypes for a numpy.record or None if no 
correction needed
    +        """
    +        import numpy as np
    +        cur_dtypes = rec.dtype
    +        col_names = cur_dtypes.names
    +        record_type_list = []
    +        has_rec_fix = False
    +        for i in xrange(len(cur_dtypes)):
    +            curr_type = cur_dtypes[i]
    +            # If type is a datetime64 timestamp, convert to microseconds
    +            # NOTE: if dtype is M8[ns] then np.record.tolist() will output 
values as longs,
    +            # this conversion will lead to an output of py datetime 
objects, see SPARK-22417
    +            if curr_type == np.dtype('M8[ns]'):
    --- End diff --
    
    Yes, I'd prefer it if that works, otherwise I'd like you to add some 
comments saying we can use `M8[ns]` instead of `datetime64[ns]`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to