[spark] branch master updated: [SPARK-44854][PYTHON] Python timedelta to DayTimeIntervalType edge case bug

gurwls223 Mon, 21 Aug 2023 19:54:58 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 9fdf65aefc5 [SPARK-44854][PYTHON] Python timedelta to 
DayTimeIntervalType edge case bug
9fdf65aefc5 is described below

commit 9fdf65aefc552c909f6643f8a31405d0622eeb7e
Author: Ocean <haghighid...@gmail.com>
AuthorDate: Tue Aug 22 11:54:40 2023 +0900

    [SPARK-44854][PYTHON] Python timedelta to DayTimeIntervalType edge case bug
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to change the way that python `datetime.timedelta` objects 
are converted to `pyspark.sql.types.DayTimeIntervalType` objects. Specifically, 
it modifies the logic inside `toInternal` which returns the timedelta as a 
python integer (would be int64 in other languages) storing the timedelta as 
microseconds. The current logic inadvertently adds an extra second when doing 
the conversion for certain python timedelta objects, thereby returning an 
incorrect value.
    
    An illustrative example is as follows:
    
    ```
    from datetime import timedelta
    from pyspark.sql.types import DayTimeIntervalType, StructField, StructType
    
    spark = ...spark session setup here...
    
    td = timedelta(days=4498031, seconds=16054, microseconds=999981)
    df = spark.createDataFrame([(td,)], 
StructType([StructField(name="timedelta_col", dataType=DayTimeIntervalType(), 
nullable=False)]))
    df.show(truncate=False)
    
    > +------------------------------------------------+
    > |timedelta_col                                   |
    > +------------------------------------------------+
    > |INTERVAL '4498031 04:27:35.999981' DAY TO SECOND|
    > +------------------------------------------------+
    
    print(str(td))
    
    >  '4498031 days, 4:27:34.999981'
    ```
    
    In the above example, look at the seconds. The original python timedelta 
object has 34 seconds, the pyspark DayTimeIntervalType column has 35 seconds.
    
    ### Why are the changes needed?
    
    To fix a bug. It is a bug because the wrong value is returned after 
conversion. Adding the above timedelta entry to existing unit tests causes the 
test to fail.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Users should now see the correct timedelta values in pyspark 
dataframes for similar such edge cases.
    
    ### How was this patch tested?
    
    Illustrative edge case examples were added to the unit test 
(`python/pyspark/sql/tests/test_types.py` the `test_daytime_interval_type` 
test), verified that the existing code failed the test, new code was added, and 
verified that the unit test now passes.
    
    ### JIRA ticket link
    This PR should close https://issues.apache.org/jira/browse/SPARK-44854
    
    Closes #42541 from hdaly0/SPARK-44854.
    
    Authored-by: Ocean <haghighid...@gmail.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/pyspark/sql/tests/test_types.py | 2 ++
 python/pyspark/sql/types.py            | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index 7cb13693a0d..90ecfd65776 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -1204,6 +1204,8 @@ class TypesTestsMixin:
             ),
             (datetime.timedelta(microseconds=-123),),
             (datetime.timedelta(days=-1),),
+            (datetime.timedelta(microseconds=388629894454999981),),
+            (datetime.timedelta(days=-1, seconds=86399, 
microseconds=999999),),  # -1 microsecond
         ]
         df = self.spark.createDataFrame(timedetlas, schema="td interval day to 
second")
         self.assertEqual(set(r.td for r in df.collect()), set(set(r[0] for r 
in timedetlas)))
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 092fa43b1d2..24964c56e2e 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -442,7 +442,7 @@ class DayTimeIntervalType(AnsiIntervalType):
 
     def toInternal(self, dt: datetime.timedelta) -> Optional[int]:
         if dt is not None:
-            return (math.floor(dt.total_seconds()) * 1000000) + dt.microseconds
+            return (((dt.days * 86400) + dt.seconds) * 1_000_000) + 
dt.microseconds
 
     def fromInternal(self, micros: int) -> Optional[datetime.timedelta]:
         if micros is not None:


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-44854][PYTHON] Python timedelta to DayTimeIntervalType edge case bug

Reply via email to