[
https://issues.apache.org/jira/browse/SPARK-24752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takeshi Yamamuro resolved SPARK-24752.
--------------------------------------
Resolution: Not A Problem
> date_format provides incorrect year after a timezone conversation changes the
> year on a timestamp
> -------------------------------------------------------------------------------------------------
>
> Key: SPARK-24752
> URL: https://issues.apache.org/jira/browse/SPARK-24752
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core, SQL
> Affects Versions: 2.1.2, 2.4.0
> Reporter: Noah Lidell
> Priority: Minor
>
> Initially I have a dataframe with a column for date and a column for time in
> UTC time. I combine these two columns into a single string that represents a
> UTC timestamp. I then convert that UTC timestamp into a Central Time
> timestamp using pyspark.sql.functions.from_utc_timestamp(). In the example
> below I have timestamps that are early in the day on Janurary 1st UTC, so
> that when they are converted to Central Time the resulting timestamp is on
> December 31st of the previous year. These timestamps are produced correctly.
> However, if I then use pyspark.sql.functions.date_format() to format a date
> from the one of these December 31st Central Time timestamps then the year
> portion of the output will be wrong.
>
>
> {code:java}
> import pyspark.sql.functions as F
> columns = ["observation_date","observation_time"]
> vals = [
> ('2015-01-01', '02:22'),
> ('2016-01-01', '02:22'),
> ('2016-01-01', '08:22'),
> ('2016-01-02', '02:22'),
> ('2017-01-01', '02:22'),
> ('2018-01-01', '02:22'),
> ('2019-01-01', '02:22'),
> ]
> test_df = spark.createDataFrame(vals, columns)
> test_df.show()
> # OUTPUT
> # +----------------+----------------+
> # |observation_date|observation_time|
> # +----------------+----------------+
> # | 2015-01-01| 02:22|
> # | 2016-01-01| 02:22|
> # | 2016-01-01| 08:22|
> # | 2016-01-02| 02:22|
> # | 2017-01-01| 02:22|
> # | 2018-01-01| 02:22|
> # | 2019-01-01| 02:22|
> # +----------------+----------------+
> renamed =
> test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")
> utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"),
> F.lit(" "), F.col("UTC_observation_time")))
> utc_ts.show()
> # OUTPUT
> # +--------------------+--------------------+----------------+
> # |UTC_observation_date|UTC_observation_time| UTC|
> # +--------------------+--------------------+----------------+
> # | 2015-01-01| 02:22|2015-01-01 02:22|
> # | 2016-01-01| 02:22|2016-01-01 02:22|
> # | 2016-01-01| 08:22|2016-01-01 08:22|
> # | 2016-01-02| 02:22|2016-01-02 02:22|
> # | 2017-01-01| 02:22|2017-01-01 02:22|
> # | 2018-01-01| 02:22|2018-01-01 02:22|
> # | 2019-01-01| 02:22|2019-01-01 02:22|
> # +--------------------+--------------------+----------------+
> central_ts = utc_ts.withColumn("central_timestamp",
> F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))
> final = central_ts.withColumn('observation_date',
> F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
> .withColumn('observation_time', F.date_format(central_ts.central_timestamp,
> "HH:mm"))
> final.select('observation_date','observation_time','central_timestamp','UTC').show()
> # OUTPUT
> # +----------------+----------------+--------------------+----------------+
> # |observation_date|observation_time| central_timestamp| UTC|
> # +----------------+----------------+--------------------+----------------+
> # | 2015-12-31| 20:22|2014-12-31 20:22:...|2015-01-01 02:22|
> # | 2016-12-31| 20:22|2015-12-31 20:22:...|2016-01-01 02:22|
> # | 2016-01-01| 02:22|2016-01-01 02:22:...|2016-01-01 08:22|
> # | 2016-01-01| 20:22|2016-01-01 20:22:...|2016-01-02 02:22|
> # | 2016-12-31| 20:22|2016-12-31 20:22:...|2017-01-01 02:22|
> # | 2018-12-31| 20:22|2017-12-31 20:22:...|2018-01-01 02:22|
> # | 2019-12-31| 20:22|2018-12-31 20:22:...|2019-01-01 02:22|
> #
> +----------------+----------------+--------------------+----------------+{code}
> The Central Time timestamp is generated correctly by the from_utc_timestamp
> function but when formatted by the date_format function _and if_ the timezone
> conversation caused a change in year then the formatted string produced by
> the date_format function will have the wrong year.
> In summary, this is the incorrect behavior:
> {code:java}
> UTC timestamp="2016-01-01 02:22"
> --(apply from_utc_timestamp)--> Central timestamp="2015-12-31 20:22"
> --(apply date_format "YYYY-MM-dd")--> date="2016-12-31"{code}
> output instead should be "2015-12-31"
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]