[ https://issues.apache.org/jira/browse/SPARK-35515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351015#comment-17351015 ]
Martin Studer commented on SPARK-35515: --------------------------------------- I'm happy to provide a PR if this seems like a sensible improvement. > TimestampType: OverflowError: mktime argument out of range > ----------------------------------------------------------- > > Key: SPARK-35515 > URL: https://issues.apache.org/jira/browse/SPARK-35515 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.1.1 > Reporter: Martin Studer > Priority: Minor > > This issue occurs, for example, when trying to create a data frame from > Python {{datetime}} objects that are "out of range" where "out of range" is > platform-dependent due to the use of > [{{time.mktime}}|https://docs.python.org/3/library/time.html#time.mktime] in > {{TimestampType.toInternal}}: > {code} > import datetime > spark_session.createDataFrame([(datetime.datetime(9999, 12, 31, 0, 0),)]) > {code} > A more direct way to reproduce the issue is by invoking > {{TimestampType.toInternal}} directly: > {code} > import datetime > from pyspark.sql.types import TimestampType > dt = datetime.datetime(9999, 12, 31, 0, 0) > TimestampType().toInternal(dt) > {code} > The suggested improvement is to avoid using {{time.mktime}} to increase the > range of {{datetime}} values. A possible implementation may look as follows: > {code} > import datetime > import pytz > EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc) > LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo > def toInternal(dt): > if dt is not None: > dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ) > dt_utc = dt.astimezone(pytz.utc) > td = dt_utc - EPOCH_UTC > return (td.days * 86400 + td.seconds) * 10 ** 6 + > td.microseconds > {code} > This relies on the ability to derive the local timezone. Other mechanisms may > be used to what is suggested above. > Test cases include: > {code} > dt1 = datetime.datetime(2021, 5, 25, 12, 23) > dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich')) > dt3 = datetime.datetime(9999, 12, 31, 0, 0) > dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich')) > toInternal(dt1) == TimestampType().toInternal(dt1) > toInternal(dt2) == TimestampType().toInternal(dt2) > toInternal(dt3) # TimestampType().toInternal(dt3) fails > toInternal(dt4) == TimestampType().toInternal(dt4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org