[GitHub] spark pull request #19646: [SPARK-22417][PYTHON] Fix for createDataFrame fro...

BryanCutler Mon, 06 Nov 2017 13:38:53 -0800

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19646#discussion_r149211050
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema):
             data = [schema.toInternal(row) for row in data]
             return self._sc.parallelize(data), schema
     
    +    def _get_numpy_record_dtypes(self, rec):
    +        """
    +        Used when converting a pandas.DataFrame to Spark using 
to_records(), this will correct
    +        the dtypes of records so they can be properly loaded into Spark.
    +        :param rec: a numpy record to check dtypes
    +        :return corrected dtypes for a numpy.record or None if no 
correction needed
    +        """
    +        import numpy as np
    +        cur_dtypes = rec.dtype
    +        col_names = cur_dtypes.names
    +        record_type_list = []
    +        has_rec_fix = False
    +        for i in xrange(len(cur_dtypes)):
    +            curr_type = cur_dtypes[i]
    +            # If type is a datetime64 timestamp, convert to microseconds
    +            # NOTE: if dtype is datetime[ns] then np.record.tolist() will 
output values as longs,
    +            # conversion from [us] or lower will lead to py datetime 
objects, see SPARK-22417
    +            if curr_type == np.dtype('datetime64[ns]'):
    +                curr_type = 'datetime64[us]'
    +                has_rec_fix = True
    +            record_type_list.append((str(col_names[i]), curr_type))
    +        return record_type_list if has_rec_fix else None
    +
    +    def _convert_from_pandas(self, pdf, schema):
    +        """
    +         Convert a pandas.DataFrame to list of records that can be used to 
make a DataFrame
    +         :return tuple of list of records and schema
    +        """
    +        # If no schema supplied by user then get the names of columns only
    +        if schema is None:
    +            schema = [str(x) for x in pdf.columns]
    +
    +        # Convert pandas.DataFrame to list of numpy records
    +        np_records = pdf.to_records(index=False)
    +
    +        # Check if any columns need to be fixed for Spark to infer properly
    +        if len(np_records) > 0:
    +            record_type_list = self._get_numpy_record_dtypes(np_records[0])
    +            if record_type_list is not None:
    +                return [r.astype(record_type_list).tolist() for r in 
np_records], schema
    --- End diff --
    
    Then that would modify the input pandas.DataFrame from the user, which 
would be bad if they use it after this call.  Making a copy of the DataFrame 
might not be good either if it is large.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19646: [SPARK-22417][PYTHON] Fix for createDataFrame fro...

Reply via email to