[GitHub] [spark] HyukjinKwon commented on a change in pull request #33877: [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark

GitBox Tue, 07 Sep 2021 17:26:18 -0700


HyukjinKwon commented on a change in pull request #33877:
URL: https://github.com/apache/spark/pull/33877#discussion_r703936859




##########
File path: python/pyspark/pandas/data_type_ops/datetime_ops.py
##########
@@ -58,15 +66,18 @@ def sub(self, left: IndexOpsLike, right: Any) -> 
SeriesOrIndex:
             "The timestamp subtraction returns an integer in seconds, "
             "whereas pandas returns 'timedelta64[ns]'."
         )
-        if isinstance(right, IndexOpsMixin) and 
isinstance(right.spark.data_type, TimestampType):
+        if isinstance(right, IndexOpsMixin) and isinstance(
+            right.spark.data_type, (TimestampType, TimestampNTZType)
+        ):
             warnings.warn(msg, UserWarning)
             return left.astype("long") - right.astype("long")

Review comment:
       Do you suggest something like `(right - left).astype("int")`? This won't 
work because:
   1. interval can't be converted to longs. To natively support this, it 
requires internal implementation on PySpark
   2. `TimestampNTZ` is considered as unix timestamp in UTC but `TIMESTAMP_LZT 
- TIMESTAMP_NZT` or `TIMESTAMP_LZT - TIMESTAMP_NZT` will assume `TIMESTAMP_NZT` 
is in local session timezone. e.g.):
       ```scala
       scala> sql("SELECT TIMESTAMP '1970-01-01 00:00:00' - TIMESTAMP_NTZ 
'1970-01-01 00:00:00'").show(false)
       ```
       should result in something like `INTERVAL '0 09:00:00' DAY TO SECOND` (I 
am in KST) but it result in `INTERVAL '0 00:00:00' DAY TO SECOND`
   
   

##########
File path: python/pyspark/pandas/data_type_ops/datetime_ops.py
##########
@@ -58,15 +66,18 @@ def sub(self, left: IndexOpsLike, right: Any) -> 
SeriesOrIndex:
             "The timestamp subtraction returns an integer in seconds, "
             "whereas pandas returns 'timedelta64[ns]'."
         )
-        if isinstance(right, IndexOpsMixin) and 
isinstance(right.spark.data_type, TimestampType):
+        if isinstance(right, IndexOpsMixin) and isinstance(
+            right.spark.data_type, (TimestampType, TimestampNTZType)
+        ):
             warnings.warn(msg, UserWarning)
             return left.astype("long") - right.astype("long")

Review comment:
       Yeah. So this one has to be removed once intervals are implemented in 
PySpark. At this moment, we cannot remove this or let Spark SQL to decide it by 
implicit cast. Not because Spark SQL does not have the implicit cast on NTZ and 
LTZ, but because PySpark doesn't have interval implementation.
   
   Or do you suggest to implement the type coercion between NTZ and LTZ in this 
PR, and use something like `(right - left).astype("int")` in this PR?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33877: [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark

Reply via email to