[jira] [Commented] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

Robert Joseph Evans (Jira) Mon, 14 Jun 2021 04:56:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362910#comment-17362910
 ]


Robert Joseph Evans commented on SPARK-35563:
---------------------------------------------

[~dc-heros] Thanks for looking into this. I was trying to understand how 
`row_number` being an int would cause data loss, I don't care about the 
overflow of the result for `row_number`. That does not feel like a bug to me, 
as you said. But it pointed me in the right direction.

At first I thought that there must be an issue with the buffer, but as I looked 
more closely I found a variable that does essentially the same thing as 
`row_number`.

https://github.com/apache/spark/blob/439e94c1712366ff267183d3946f2507ebf3a98e/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L179

It is an int and is overflowing causing the iterator to terminate the loop 
early. I think if we switch `rowNumber` to be a long the data loss problem will 
go away.  But it is not a super trivial change, because `WindowFunctionFrame` 
takes that variable as input to `write` so we need to decide if we would rather 
blow up when rowNumber overflows, or if we should change the index to be a long 
and pull on that string until everything is updated.

> [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-35563
>                 URL: https://issues.apache.org/jira/browse/SPARK-35563
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.2
>            Reporter: Robert Joseph Evans
>            Priority: Blocker
>              Labels: data-loss
>
> I think this impacts a lot more versions of Spark, but I don't know for sure 
> because it takes a long time to test. As a part of doing corner case 
> validation testing for spark rapids I found that if a window function has 
> more than {{Int.MaxValue + 1}} rows the result is silently truncated to that 
> many rows. I have only tested this on 3.0.2 with {{row_number}}, but I 
> suspect it will impact others as well. This is a really rare corner case, but 
> because it is silent data corruption I personally think it is quite serious.
> {code:scala}
> import org.apache.spark.sql.expressions.Window
> val windowSpec = Window.partitionBy("a").orderBy("b")
> val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as 
> b")
> spark.time(df.select(col("a"), col("b"), 
> row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
> desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
> +-----+----------+                                                            
>   
> |  dir|     count|
> +-----+----------+
> |false|2147483647|
> | true|         1|
> +-----+----------+
> Time taken: 1139089 ms
> Int.MaxValue.toLong + 100
> res15: Long = 2147483747
> 2147483647L + 1
> res16: Long = 2147483648
> {code}
> I had to make sure that I ran the above with at least 64GiB of heap for the 
> executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

Reply via email to