[ https://issues.apache.org/jira/browse/SPARK-24752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536519#comment-16536519 ]
Takeshi Yamamuro commented on SPARK-24752: ------------------------------------------ I closed this cuz this is expected. > date_format provides incorrect year after a timezone conversation changes the > year on a timestamp > ------------------------------------------------------------------------------------------------- > > Key: SPARK-24752 > URL: https://issues.apache.org/jira/browse/SPARK-24752 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL > Affects Versions: 2.1.2, 2.4.0 > Reporter: Noah Lidell > Priority: Minor > > Initially I have a dataframe with a column for date and a column for time in > UTC time. I combine these two columns into a single string that represents a > UTC timestamp. I then convert that UTC timestamp into a Central Time > timestamp using pyspark.sql.functions.from_utc_timestamp(). In the example > below I have timestamps that are early in the day on Janurary 1st UTC, so > that when they are converted to Central Time the resulting timestamp is on > December 31st of the previous year. These timestamps are produced correctly. > However, if I then use pyspark.sql.functions.date_format() to format a date > from the one of these December 31st Central Time timestamps then the year > portion of the output will be wrong. > > > {code:java} > import pyspark.sql.functions as F > columns = ["observation_date","observation_time"] > vals = [ > ('2015-01-01', '02:22'), > ('2016-01-01', '02:22'), > ('2016-01-01', '08:22'), > ('2016-01-02', '02:22'), > ('2017-01-01', '02:22'), > ('2018-01-01', '02:22'), > ('2019-01-01', '02:22'), > ] > test_df = spark.createDataFrame(vals, columns) > test_df.show() > # OUTPUT > # +----------------+----------------+ > # |observation_date|observation_time| > # +----------------+----------------+ > # | 2015-01-01| 02:22| > # | 2016-01-01| 02:22| > # | 2016-01-01| 08:22| > # | 2016-01-02| 02:22| > # | 2017-01-01| 02:22| > # | 2018-01-01| 02:22| > # | 2019-01-01| 02:22| > # +----------------+----------------+ > renamed = > test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time") > utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), > F.lit(" "), F.col("UTC_observation_time"))) > utc_ts.show() > # OUTPUT > # +--------------------+--------------------+----------------+ > # |UTC_observation_date|UTC_observation_time| UTC| > # +--------------------+--------------------+----------------+ > # | 2015-01-01| 02:22|2015-01-01 02:22| > # | 2016-01-01| 02:22|2016-01-01 02:22| > # | 2016-01-01| 08:22|2016-01-01 08:22| > # | 2016-01-02| 02:22|2016-01-02 02:22| > # | 2017-01-01| 02:22|2017-01-01 02:22| > # | 2018-01-01| 02:22|2018-01-01 02:22| > # | 2019-01-01| 02:22|2019-01-01 02:22| > # +--------------------+--------------------+----------------+ > central_ts = utc_ts.withColumn("central_timestamp", > F.from_utc_timestamp(utc_ts.UTC, 'US/Central')) > final = central_ts.withColumn('observation_date', > F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\ > .withColumn('observation_time', F.date_format(central_ts.central_timestamp, > "HH:mm")) > final.select('observation_date','observation_time','central_timestamp','UTC').show() > # OUTPUT > # +----------------+----------------+--------------------+----------------+ > # |observation_date|observation_time| central_timestamp| UTC| > # +----------------+----------------+--------------------+----------------+ > # | 2015-12-31| 20:22|2014-12-31 20:22:...|2015-01-01 02:22| > # | 2016-12-31| 20:22|2015-12-31 20:22:...|2016-01-01 02:22| > # | 2016-01-01| 02:22|2016-01-01 02:22:...|2016-01-01 08:22| > # | 2016-01-01| 20:22|2016-01-01 20:22:...|2016-01-02 02:22| > # | 2016-12-31| 20:22|2016-12-31 20:22:...|2017-01-01 02:22| > # | 2018-12-31| 20:22|2017-12-31 20:22:...|2018-01-01 02:22| > # | 2019-12-31| 20:22|2018-12-31 20:22:...|2019-01-01 02:22| > # > +----------------+----------------+--------------------+----------------+{code} > The Central Time timestamp is generated correctly by the from_utc_timestamp > function but when formatted by the date_format function _and if_ the timezone > conversation caused a change in year then the formatted string produced by > the date_format function will have the wrong year. > In summary, this is the incorrect behavior: > {code:java} > UTC timestamp="2016-01-01 02:22" > --(apply from_utc_timestamp)--> Central timestamp="2015-12-31 20:22" > --(apply date_format "YYYY-MM-dd")--> date="2016-12-31"{code} > output instead should be "2015-12-31" > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org