spark git commit: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment

ueshin Tue, 09 Jan 2018 21:01:13 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 45f5c3cee -> 20a8c8867



[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series 
assignment

## What changes were proposed in this pull request?

This fixes createDataFrame from Pandas to only assign modified timestamp series 
back to a copied version of the Pandas DataFrame.  Previously, if the Pandas 
DataFrame was only a reference (e.g. a slice of another) each series will still 
get assigned back to the reference even if it is not a modified timestamp 
column.  This caused the following warning "SettingWithCopyWarning: A value is 
trying to be set on a copy of a slice from a DataFrame."

## How was this patch tested?

existing tests

Author: Bryan Cutler <[email protected]>

Closes #20213 from 
BryanCutler/pyspark-createDataFrame-copy-slice-warn-SPARK-23018.

(cherry picked from commit 7bcc2666810cefc85dfa0d6679ac7a0de9e23154)
Signed-off-by: Takuya UESHIN <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/20a8c886
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/20a8c886
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/20a8c886

Branch: refs/heads/branch-2.3
Commit: 20a8c88671da73ab5d0e7048fe0424022ec16853
Parents: 45f5c3c
Author: Bryan Cutler <[email protected]>
Authored: Wed Jan 10 14:00:07 2018 +0900
Committer: Takuya UESHIN <[email protected]>
Committed: Wed Jan 10 14:00:19 2018 +0900

----------------------------------------------------------------------
 python/pyspark/sql/session.py | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/20a8c886/python/pyspark/sql/session.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 6052fa9..3e45747 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -459,21 +459,23 @@ class SparkSession(object):
                     # TODO: handle nested timestamps, such as 
ArrayType(TimestampType())?
                     if isinstance(field.dataType, TimestampType):
                         s = 
_check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
-                        if not copied and s is not pdf[field.name]:
-                            # Copy once if the series is modified to prevent 
the original Pandas
-                            # DataFrame from being updated
-                            pdf = pdf.copy()
-                            copied = True
-                        pdf[field.name] = s
+                        if s is not pdf[field.name]:
+                            if not copied:
+                                # Copy once if the series is modified to 
prevent the original
+                                # Pandas DataFrame from being updated
+                                pdf = pdf.copy()
+                                copied = True
+                            pdf[field.name] = s
             else:
                 for column, series in pdf.iteritems():
-                    s = _check_series_convert_timestamps_tz_local(pdf[column], 
timezone)
-                    if not copied and s is not pdf[column]:
-                        # Copy once if the series is modified to prevent the 
original Pandas
-                        # DataFrame from being updated
-                        pdf = pdf.copy()
-                        copied = True
-                    pdf[column] = s
+                    s = _check_series_convert_timestamps_tz_local(series, 
timezone)
+                    if s is not series:
+                        if not copied:
+                            # Copy once if the series is modified to prevent 
the original
+                            # Pandas DataFrame from being updated
+                            pdf = pdf.copy()
+                            copied = True
+                        pdf[column] = s
 
         # Convert pandas.DataFrame to list of numpy records
         np_records = pdf.to_records(index=False)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment

Reply via email to