[ 
https://issues.apache.org/jira/browse/SPARK-44854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44854.
----------------------------------
    Fix Version/s: 3.5.0
                   4.0.0
                   3.4.2
       Resolution: Fixed

Issue resolved by pull request 42541
[https://github.com/apache/spark/pull/42541]

> Python timedelta to DayTimeIntervalType edge cases bug
> ------------------------------------------------------
>
>                 Key: SPARK-44854
>                 URL: https://issues.apache.org/jira/browse/SPARK-44854
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Ocean HD
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.5.0, 4.0.0, 3.4.2
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> h1. Python Timedelta to PySpark DayTimeIntervalType bug
> There is a bug that exists which means certain Python datetime.timedelta 
> objects get converted to a PySpark DayTimeIntervalType column with a 
> different value to that which is stored in the Python timedelta.
> A simple illustrative example can be produced with the below code:
>  
> {code:java}
> from datetime import timedelta
> from pyspark.sql.types import DayTimeIntervalType, StructField, StructType
> spark = ...spark session setup here...
> td = timedelta(days=4498031, seconds=16054, microseconds=999981)
> df = spark.createDataFrame([(td,)], 
> StructType([StructField(name="timedelta_col", dataType=DayTimeIntervalType(), 
> nullable=False)]))
> df.show(truncate=False)
> > +------------------------------------------------+
> > |timedelta_col                                   | 
> > +------------------------------------------------+ 
> > |INTERVAL '4498031 04:27:35.999981' DAY TO SECOND| 
> > +------------------------------------------------+
> print(str(td))
> >  '4498031 days, 4:27:34.999981' {code}
> In the above example, look at the seconds. The original python timedelta 
> object has 34 seconds, the pyspark DayTimeIntervalType column has 35 seconds.
> h1. Fix
> This issue arises because the current conversion from python timedelta uses 
> the timedelta function `.total_seconds()` to get the number of seconds, and 
> then adds the microsecond component back in afterwards. Unfortunately the 
> `.total_seconds()` function with some timedeltas (ones with microsecond 
> entries close to 1_000_000 I believe) ends up rounding *up* to the nearest 
> second (probably due to floating point precision), with the microseconds then 
> added on top of that. The effect is that 1 second gets added incorrectly.
> The issue can be fixed by doing the processing in a slightly different way. 
> Instead of doing:
>  
> {code:java}
> (math.floor(dt.total_seconds()) * 1000000) + dt.microseconds{code}
>  
> Instead we construct the timedelta from its components:
>  
> {code:java}
> (((dt.days * 86400) + dt.seconds) * 1_000_000) + dt.microseconds {code}
>  
> h1. Tests
> An illustrative edge case example for timedeltas is the above (which can also 
> be written as `datetime.timedelta(microseconds=388629894454999981)`)
>  
> A related edge case which is already handled but not tested exists for the 
> situation where there are positive and negative components to the created 
> timedelta object. An entry for this edge case is also included as it is 
> related.
> h1. PR
> Link to the PR addressing this issue: 
> https://github.com/apache/spark/pull/42541
> h1. Keywords to help people searching for this issue:
> datetime.timedelta
> timedelta
> pyspark.sql.types.DayTimeIntervalType
> DayTimeIntervalType
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to