[jira] [Commented] (SPARK-27450) Timestamp cast fails when the ISO8601 string omits minutes, seconds or milliseconds

Will Zimmerman (Jira) Fri, 02 Apr 2021 14:27:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314093#comment-17314093
 ]


Will Zimmerman commented on SPARK-27450:
----------------------------------------

[~maxgekk] the proposed solution allows for handling of a specific use case, 
but it doesn't allow a higher level of time specificity to make it through. 

 

The issue is that a valid [ISO 
8601|https://en.wikipedia.org/wiki/ISO_8601#Times] format isn't able to be 
interpreted by the built-in 
[timestamp/to_timestamp|https://spark.apache.org/docs/latest/api/sql/index.html#to_timestamp]
 Spark function. As can be seen in the example below, the fact that time zone 
designator is present causes the transformation to become Null.
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP('2020-05-19 10:30Z') AS timestamp_1
   ,TO_TIMESTAMP('2020-05-19 10:30') AS timestamp_2
""").show(truncate=False)

# Result
+-----------+-------------------+
|timestamp_1|timestamp_2        |
+-----------+-------------------+
|null       |2020-05-19 10:30:00|
+-----------+-------------------+{code}
The proposed workaround would take the specified formatting as you described 
above, but it would only be applicable to that specific format and not the 
other valid ISO 8601 formats.

 
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP('2020-05-19 10:30Z', 'yyyy-MM-dd HH:mmX') AS timestamp_1
   ,TO_TIMESTAMP('2020-05-19 10:30:19Z', 'yyyy-MM-dd HH:mmX') AS timestamp_2
   ,TO_TIMESTAMP('2020-05-19 10:30:19.001Z', 'yyyy-MM-dd HH:mmX') AS timestamp_3
""").show(truncate=False)

# Result
+-------------------+-----------+-----------+
|timestamp_1        |timestamp_2|timestamp_3|
+-------------------+-----------+-----------+
|2020-05-19 10:30:00|null       |null       |
+-------------------+-----------+-----------+
{code}
The other option I tried below ended up cutting off any additional time 
granularity past the specified format. This behavior makes sense, but it is 
problematic for data that has multiple degrees of time granularity.

 

 
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP('2020-05-19 10:30Z', 'yyyy-MM-dd HH:mm') AS timestamp_1
   ,TO_TIMESTAMP('2020-05-19 10:30:19Z', 'yyyy-MM-dd HH:mm') AS timestamp_2
   ,TO_TIMESTAMP('2020-05-19 10:30:19.001Z', 'yyyy-MM-dd HH:mm') AS timestamp_3
""").show(truncate=False)

# Result
+-------------------+-------------------+-------------------+
|timestamp_1        |timestamp_2        |timestamp_3        |
+-------------------+-------------------+-------------------+
|2020-05-19 10:30:00|2020-05-19 10:30:00|2020-05-19 10:30:00|
+-------------------+-------------------+-------------------+
{code}
The only workaround I've found that would work in any of the ISO 8601 formats 
is used in the following.

 

 
{code:java}
spark.sql(F"""
SELECT
    TO_TIMESTAMP(REPLACE('2020-05-19 10:30Z', 'Z', '')) AS timestamp_1
   ,TO_TIMESTAMP(REPLACE('2020-05-19 10:30:19Z', 'Z', '')) AS timestamp_2
   ,TO_TIMESTAMP(REPLACE('2020-05-19 10:30:19.001Z', 'Z', '')) AS timestamp_3
""").show(truncate=False)

# Result
+-------------------+-------------------+-----------------------+
|timestamp_1        |timestamp_2        |timestamp_3            |
+-------------------+-------------------+-----------------------+
|2020-05-19 10:30:00|2020-05-19 10:30:19|2020-05-19 10:30:19.001|
+-------------------+-------------------+-----------------------+
{code}
Is it possible to be able to have the built-in Spark function to have the 
"yyyy-MM-dd HH:mmX" format added to the existing accepted formats?

 

> Timestamp cast fails when the ISO8601 string omits minutes, seconds or 
> milliseconds
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-27450
>                 URL: https://issues.apache.org/jira/browse/SPARK-27450
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>         Environment: Spark 2.3.x
>            Reporter: Leandro Rosa
>            Priority: Major
>
> ISO8601 allows to omit minutes, seconds and milliseconds.
> {quote}
> |hh:mm:ss.sss|_or_|hhmmss.sss|
> |hh:mm:ss|_or_|hhmmss|
> |hh:mm|_or_|hhmm|
> | |hh|
> {quote}
> {quote}Either the seconds, or the minutes and seconds, may be omitted from 
> the basic or extended time formats for greater brevity but decreased 
> accuracy: [hh]:[mm], [hh][mm] and [hh] are the resulting reduced accuracy 
> time formats
> {quote}
> Source: [Wikipedia ISO8601|https://en.wikipedia.org/wiki/ISO_8601]
> Popular libs, such as 
> [ZonedDateTime|https://docs.oracle.com/javase/8/docs/api/java/time/ZonedDateTime.html],
>  respect that. However, Timestamp cast fails silently.
>  
> {code:java}
> import org.apache.spark.sql.types._
> val df1 = Seq(("2017-08-01T02:33")).toDF("eventTimeString") // NON-ISO8601 
> (missing TZ offset) [OK]
> val new_df1 = df1
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df1.show(false)
> +----------------+-------------------+
> |eventTimeString |eventTimeTS |
> +----------------+-------------------+
> |2017-08-01T02:33|2017-08-01 02:33:00|
> +----------------+-------------------+
> {code}
> {code:java}
> val df2 = Seq(("2017-08-01T02:33Z")).toDF("eventTimeString") // ISO8601 [FAIL]
> val new_df2 = df2
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df2.show(false)
> +-----------------+-----------+
> |eventTimeString |eventTimeTS|
> +-----------------+-----------+
> |2017-08-01T02:33Z|null |
> +-----------------+-----------+
> {code}
>  
> {code:java}
> val df3 = Seq(("2017-08-01T02:33-03:00")).toDF("eventTimeString") // ISO8601 
> [FAIL]
> val new_df3 = df3
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df3.show(false)
> +----------------------+-----------+
> |eventTimeString |eventTimeTS|
> +----------------------+-----------+
> |2017-08-01T02:33-03:00|null |
> +----------------------+-----------+
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-27450) Timestamp cast fails when the ISO8601 string omits minutes, seconds or milliseconds

Reply via email to