[ 
https://issues.apache.org/jira/browse/SPARK-24752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536519#comment-16536519
 ] 

Takeshi Yamamuro commented on SPARK-24752:
------------------------------------------

I closed this cuz this is expected.

> date_format provides incorrect year after a timezone conversation changes the 
> year on a timestamp
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24752
>                 URL: https://issues.apache.org/jira/browse/SPARK-24752
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, SQL
>    Affects Versions: 2.1.2, 2.4.0
>            Reporter: Noah Lidell
>            Priority: Minor
>
> Initially I have a dataframe with a column for date and a column for time in 
> UTC time. I combine these two columns into a single string that represents a 
> UTC timestamp. I then convert that UTC timestamp into a Central Time 
> timestamp using pyspark.sql.functions.from_utc_timestamp(). In the example 
> below I have timestamps that are early in the day on Janurary 1st UTC, so 
> that when they are converted to Central Time the resulting timestamp is on 
> December 31st of the previous year. These timestamps are produced correctly. 
> However, if I then use pyspark.sql.functions.date_format() to format a date 
> from the one of these December 31st Central Time timestamps then the year 
> portion of the output will be wrong. 
>  
>  
> {code:java}
> import pyspark.sql.functions as F
> columns = ["observation_date","observation_time"]
> vals = [
> ('2015-01-01', '02:22'),
> ('2016-01-01', '02:22'),
> ('2016-01-01', '08:22'),
> ('2016-01-02', '02:22'),
> ('2017-01-01', '02:22'),
> ('2018-01-01', '02:22'),
> ('2019-01-01', '02:22'),
> ]
> test_df = spark.createDataFrame(vals, columns)
> test_df.show()
> # OUTPUT
> # +----------------+----------------+
> # |observation_date|observation_time|
> # +----------------+----------------+
> # |      2015-01-01|           02:22|
> # |      2016-01-01|           02:22|
> # |      2016-01-01|           08:22|
> # |      2016-01-02|           02:22|
> # |      2017-01-01|           02:22|
> # |      2018-01-01|           02:22|
> # |      2019-01-01|           02:22|
> # +----------------+----------------+
> renamed = 
> test_df.withColumnRenamed("observation_date","UTC_observation_date").withColumnRenamed("observation_time","UTC_observation_time")
> utc_ts = renamed.withColumn("UTC", F.concat(F.col("UTC_observation_date"), 
> F.lit(" "), F.col("UTC_observation_time")))
> utc_ts.show()
> # OUTPUT
> # +--------------------+--------------------+----------------+
> # |UTC_observation_date|UTC_observation_time|             UTC|
> # +--------------------+--------------------+----------------+
> # |          2015-01-01|               02:22|2015-01-01 02:22|
> # |          2016-01-01|               02:22|2016-01-01 02:22|
> # |          2016-01-01|               08:22|2016-01-01 08:22|
> # |          2016-01-02|               02:22|2016-01-02 02:22|
> # |          2017-01-01|               02:22|2017-01-01 02:22|
> # |          2018-01-01|               02:22|2018-01-01 02:22|
> # |          2019-01-01|               02:22|2019-01-01 02:22|
> # +--------------------+--------------------+----------------+
> central_ts = utc_ts.withColumn("central_timestamp", 
> F.from_utc_timestamp(utc_ts.UTC, 'US/Central'))
> final = central_ts.withColumn('observation_date', 
> F.date_format(central_ts.central_timestamp, "YYYY-MM-dd"))\
> .withColumn('observation_time', F.date_format(central_ts.central_timestamp, 
> "HH:mm"))
> final.select('observation_date','observation_time','central_timestamp','UTC').show()
> # OUTPUT
> # +----------------+----------------+--------------------+----------------+
> # |observation_date|observation_time|   central_timestamp|             UTC|
> # +----------------+----------------+--------------------+----------------+
> # |      2015-12-31|           20:22|2014-12-31 20:22:...|2015-01-01 02:22|
> # |      2016-12-31|           20:22|2015-12-31 20:22:...|2016-01-01 02:22|
> # |      2016-01-01|           02:22|2016-01-01 02:22:...|2016-01-01 08:22|
> # |      2016-01-01|           20:22|2016-01-01 20:22:...|2016-01-02 02:22|
> # |      2016-12-31|           20:22|2016-12-31 20:22:...|2017-01-01 02:22|
> # |      2018-12-31|           20:22|2017-12-31 20:22:...|2018-01-01 02:22|
> # |      2019-12-31|           20:22|2018-12-31 20:22:...|2019-01-01 02:22|
> # 
> +----------------+----------------+--------------------+----------------+{code}
> The Central Time timestamp is generated correctly by the from_utc_timestamp 
> function but when formatted by the date_format function _and if_ the timezone 
> conversation caused a change in year then the formatted string produced by 
> the date_format function will have the wrong year.
> In summary, this is the incorrect behavior:
> {code:java}
> UTC timestamp="2016-01-01 02:22" 
> --(apply from_utc_timestamp)--> Central timestamp="2015-12-31 20:22" 
> --(apply date_format "YYYY-MM-dd")--> date="2016-12-31"{code}
> output instead should be "2015-12-31"
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to