[
https://issues.apache.org/jira/browse/SPARK-38577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508269#comment-17508269
]
Robert Joseph Evans commented on SPARK-38577:
---------------------------------------------
This is especially problematic because it is really inconsistent.
{code:scala}
val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)),
Row(Duration.ofDays(2).plusMinutes(2)))
val schema = StructType(Array(StructField("dur",
DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY))))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.selectExpr("dur", "CAST(dur AS long)", "CAST('1970-1-1' as timestamp) + dur
as ts").show()
+----------------+---+-------------------+
| dur|dur| ts|
+----------------+---+-------------------+
|INTERVAL '1' DAY| 1|1970-01-02 00:00:01|
|INTERVAL '2' DAY| 2|1970-01-03 00:02:00|
+----------------+---+-------------------+
df.select(col("dur"),
col("dur").cast(DayTimeIntervalType()).alias("default_dur")).show(truncate =
false)
+----------------+-----------------------------------+
|dur |default_dur |
+----------------+-----------------------------------+
|INTERVAL '1' DAY|INTERVAL '1 00:00:01' DAY TO SECOND|
|INTERVAL '2' DAY|INTERVAL '2 00:02:00' DAY TO SECOND|
+----------------+-----------------------------------+
{code}
Casting the values to different types will truncate it if dropping precision,
but increasing precision or doing math with it does not.
Saving the data to parquet keeps the data exactly the same as was input, but
doing it to CSV truncates it.
{code:scala}
df.write.parquet("./tmp")
val df2 = spark.read.parquet("./tmp")
df2.selectExpr("dur", "CAST(dur AS long)", "CAST('1970-1-1' as timestamp) + dur
as ts").show()
+----------------+---+-------------------+
| dur|dur| ts|
+----------------+---+-------------------+
|INTERVAL '1' DAY| 1|1970-01-02 00:00:01|
|INTERVAL '2' DAY| 2|1970-01-03 00:02:00|
+----------------+---+-------------------+
df.write.csv("./tmp_csv")
val df3 = spark.read.schema(schema).csv("./tmp_csv")
df3.selectExpr("dur", "CAST(dur AS long)", "CAST('1970-1-1' as timestamp) + dur
as ts").show()
+----------------+---+-------------------+
| dur|dur| ts|
+----------------+---+-------------------+
|INTERVAL '2' DAY| 2|1970-01-03 00:00:00|
|INTERVAL '1' DAY| 1|1970-01-02 00:00:00|
+----------------+---+-------------------+
{code}
This is all also true in the python API.
I would expect to get an error when importing the data, or have Spark
truncate/fix the data when it is imported so I don't get inconsistent and
confusing results with it.
If this works as expected, then I would like to see it documented better what
is happening.
> Interval types are not truncated to the expected endField when creating a
> DataFrame via Duration
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-38577
> URL: https://issues.apache.org/jira/browse/SPARK-38577
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot version
>
> Reporter: chong
> Priority: Major
>
> *Problem:*
> ANSI interval types are store as long internally.
> The long value are not truncated to the expected endField when creating a
> DataFrame via Duration.
>
> *Reproduce:*
> Create a "day to day" interval, the seconds are not truncated, see below code.
> The internal long is not {*}86400 * 1000000{*}, but it's ({*}86400 + 1) *
> 1000000{*}{*}{{*}}
>
> {code:java}
> test("my test") {
> val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)))
> val schema = StructType(Array(
> StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY,
> DayTimeIntervalType.DAY))
> ))
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data),
> schema)
> df.show()
> } {code}
>
>
> After debug, the {{endField}} is always {{SECOND}} in
> {{{}durationToMicros{}}}, see below:
>
> {code:java}
> // IntervalUtils class
> def durationToMicros(duration: Duration): Long = {
> durationToMicros(duration, DT.SECOND) // always SECOND
> }
> def durationToMicros(duration: Duration, endField: Byte)
> {code}
> Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND]
> Or Spark can throw an exception to avoid truncating.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]