[
https://issues.apache.org/jira/browse/SPARK-27450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314093#comment-17314093
]
Will Zimmerman commented on SPARK-27450:
----------------------------------------
[~maxgekk] the proposed solution allows for handling of a specific use case,
but it doesn't allow a higher level of time specificity to make it through.
The issue is that a valid [ISO
8601|https://en.wikipedia.org/wiki/ISO_8601#Times] format isn't able to be
interpreted by the built-in
[timestamp/to_timestamp|https://spark.apache.org/docs/latest/api/sql/index.html#to_timestamp]
Spark function. As can be seen in the example below, the fact that time zone
designator is present causes the transformation to become Null.
{code:java}
spark.sql(F"""
SELECT
TO_TIMESTAMP('2020-05-19 10:30Z') AS timestamp_1
,TO_TIMESTAMP('2020-05-19 10:30') AS timestamp_2
""").show(truncate=False)
# Result
+-----------+-------------------+
|timestamp_1|timestamp_2 |
+-----------+-------------------+
|null |2020-05-19 10:30:00|
+-----------+-------------------+{code}
The proposed workaround would take the specified formatting as you described
above, but it would only be applicable to that specific format and not the
other valid ISO 8601 formats.
{code:java}
spark.sql(F"""
SELECT
TO_TIMESTAMP('2020-05-19 10:30Z', 'yyyy-MM-dd HH:mmX') AS timestamp_1
,TO_TIMESTAMP('2020-05-19 10:30:19Z', 'yyyy-MM-dd HH:mmX') AS timestamp_2
,TO_TIMESTAMP('2020-05-19 10:30:19.001Z', 'yyyy-MM-dd HH:mmX') AS timestamp_3
""").show(truncate=False)
# Result
+-------------------+-----------+-----------+
|timestamp_1 |timestamp_2|timestamp_3|
+-------------------+-----------+-----------+
|2020-05-19 10:30:00|null |null |
+-------------------+-----------+-----------+
{code}
The other option I tried below ended up cutting off any additional time
granularity past the specified format. This behavior makes sense, but it is
problematic for data that has multiple degrees of time granularity.
{code:java}
spark.sql(F"""
SELECT
TO_TIMESTAMP('2020-05-19 10:30Z', 'yyyy-MM-dd HH:mm') AS timestamp_1
,TO_TIMESTAMP('2020-05-19 10:30:19Z', 'yyyy-MM-dd HH:mm') AS timestamp_2
,TO_TIMESTAMP('2020-05-19 10:30:19.001Z', 'yyyy-MM-dd HH:mm') AS timestamp_3
""").show(truncate=False)
# Result
+-------------------+-------------------+-------------------+
|timestamp_1 |timestamp_2 |timestamp_3 |
+-------------------+-------------------+-------------------+
|2020-05-19 10:30:00|2020-05-19 10:30:00|2020-05-19 10:30:00|
+-------------------+-------------------+-------------------+
{code}
The only workaround I've found that would work in any of the ISO 8601 formats
is used in the following.
{code:java}
spark.sql(F"""
SELECT
TO_TIMESTAMP(REPLACE('2020-05-19 10:30Z', 'Z', '')) AS timestamp_1
,TO_TIMESTAMP(REPLACE('2020-05-19 10:30:19Z', 'Z', '')) AS timestamp_2
,TO_TIMESTAMP(REPLACE('2020-05-19 10:30:19.001Z', 'Z', '')) AS timestamp_3
""").show(truncate=False)
# Result
+-------------------+-------------------+-----------------------+
|timestamp_1 |timestamp_2 |timestamp_3 |
+-------------------+-------------------+-----------------------+
|2020-05-19 10:30:00|2020-05-19 10:30:19|2020-05-19 10:30:19.001|
+-------------------+-------------------+-----------------------+
{code}
Is it possible to be able to have the built-in Spark function to have the
"yyyy-MM-dd HH:mmX" format added to the existing accepted formats?
> Timestamp cast fails when the ISO8601 string omits minutes, seconds or
> milliseconds
> -----------------------------------------------------------------------------------
>
> Key: SPARK-27450
> URL: https://issues.apache.org/jira/browse/SPARK-27450
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Environment: Spark 2.3.x
> Reporter: Leandro Rosa
> Priority: Major
>
> ISO8601 allows to omit minutes, seconds and milliseconds.
> {quote}
> |hh:mm:ss.sss|_or_|hhmmss.sss|
> |hh:mm:ss|_or_|hhmmss|
> |hh:mm|_or_|hhmm|
> | |hh|
> {quote}
> {quote}Either the seconds, or the minutes and seconds, may be omitted from
> the basic or extended time formats for greater brevity but decreased
> accuracy: [hh]:[mm], [hh][mm] and [hh] are the resulting reduced accuracy
> time formats
> {quote}
> Source: [Wikipedia ISO8601|https://en.wikipedia.org/wiki/ISO_8601]
> Popular libs, such as
> [ZonedDateTime|https://docs.oracle.com/javase/8/docs/api/java/time/ZonedDateTime.html],
> respect that. However, Timestamp cast fails silently.
>
> {code:java}
> import org.apache.spark.sql.types._
> val df1 = Seq(("2017-08-01T02:33")).toDF("eventTimeString") // NON-ISO8601
> (missing TZ offset) [OK]
> val new_df1 = df1
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df1.show(false)
> +----------------+-------------------+
> |eventTimeString |eventTimeTS |
> +----------------+-------------------+
> |2017-08-01T02:33|2017-08-01 02:33:00|
> +----------------+-------------------+
> {code}
> {code:java}
> val df2 = Seq(("2017-08-01T02:33Z")).toDF("eventTimeString") // ISO8601 [FAIL]
> val new_df2 = df2
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df2.show(false)
> +-----------------+-----------+
> |eventTimeString |eventTimeTS|
> +-----------------+-----------+
> |2017-08-01T02:33Z|null |
> +-----------------+-----------+
> {code}
>
> {code:java}
> val df3 = Seq(("2017-08-01T02:33-03:00")).toDF("eventTimeString") // ISO8601
> [FAIL]
> val new_df3 = df3
> .withColumn("eventTimeTS", col("eventTimeString").cast(TimestampType))
> new_df3.show(false)
> +----------------------+-----------+
> |eventTimeString |eventTimeTS|
> +----------------------+-----------+
> |2017-08-01T02:33-03:00|null |
> +----------------------+-----------+
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]