[ https://issues.apache.org/jira/browse/SPARK-44854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-44854. ---------------------------------- Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42541 [https://github.com/apache/spark/pull/42541] > Python timedelta to DayTimeIntervalType edge cases bug > ------------------------------------------------------ > > Key: SPARK-44854 > URL: https://issues.apache.org/jira/browse/SPARK-44854 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.4.0 > Reporter: Ocean HD > Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0, 4.0.0, 3.4.2 > > Original Estimate: 3h > Remaining Estimate: 3h > > h1. Python Timedelta to PySpark DayTimeIntervalType bug > There is a bug that exists which means certain Python datetime.timedelta > objects get converted to a PySpark DayTimeIntervalType column with a > different value to that which is stored in the Python timedelta. > A simple illustrative example can be produced with the below code: > > {code:java} > from datetime import timedelta > from pyspark.sql.types import DayTimeIntervalType, StructField, StructType > spark = ...spark session setup here... > td = timedelta(days=4498031, seconds=16054, microseconds=999981) > df = spark.createDataFrame([(td,)], > StructType([StructField(name="timedelta_col", dataType=DayTimeIntervalType(), > nullable=False)])) > df.show(truncate=False) > > +------------------------------------------------+ > > |timedelta_col | > > +------------------------------------------------+ > > |INTERVAL '4498031 04:27:35.999981' DAY TO SECOND| > > +------------------------------------------------+ > print(str(td)) > > '4498031 days, 4:27:34.999981' {code} > In the above example, look at the seconds. The original python timedelta > object has 34 seconds, the pyspark DayTimeIntervalType column has 35 seconds. > h1. Fix > This issue arises because the current conversion from python timedelta uses > the timedelta function `.total_seconds()` to get the number of seconds, and > then adds the microsecond component back in afterwards. Unfortunately the > `.total_seconds()` function with some timedeltas (ones with microsecond > entries close to 1_000_000 I believe) ends up rounding *up* to the nearest > second (probably due to floating point precision), with the microseconds then > added on top of that. The effect is that 1 second gets added incorrectly. > The issue can be fixed by doing the processing in a slightly different way. > Instead of doing: > > {code:java} > (math.floor(dt.total_seconds()) * 1000000) + dt.microseconds{code} > > Instead we construct the timedelta from its components: > > {code:java} > (((dt.days * 86400) + dt.seconds) * 1_000_000) + dt.microseconds {code} > > h1. Tests > An illustrative edge case example for timedeltas is the above (which can also > be written as `datetime.timedelta(microseconds=388629894454999981)`) > > A related edge case which is already handled but not tested exists for the > situation where there are positive and negative components to the created > timedelta object. An entry for this edge case is also included as it is > related. > h1. PR > Link to the PR addressing this issue: > https://github.com/apache/spark/pull/42541 > h1. Keywords to help people searching for this issue: > datetime.timedelta > timedelta > pyspark.sql.types.DayTimeIntervalType > DayTimeIntervalType > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org