[ https://issues.apache.org/jira/browse/SPARK-35563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362910#comment-17362910 ]
Robert Joseph Evans commented on SPARK-35563: --------------------------------------------- [~dc-heros] Thanks for looking into this. I was trying to understand how `row_number` being an int would cause data loss, I don't care about the overflow of the result for `row_number`. That does not feel like a bug to me, as you said. But it pointed me in the right direction. At first I thought that there must be an issue with the buffer, but as I looked more closely I found a variable that does essentially the same thing as `row_number`. https://github.com/apache/spark/blob/439e94c1712366ff267183d3946f2507ebf3a98e/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L179 It is an int and is overflowing causing the iterator to terminate the loop early. I think if we switch `rowNumber` to be a long the data loss problem will go away. But it is not a super trivial change, because `WindowFunctionFrame` takes that variable as input to `write` so we need to decide if we would rather blow up when rowNumber overflows, or if we should change the index to be a long and pull on that string until everything is updated. > [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows > ------------------------------------------------------------------------------ > > Key: SPARK-35563 > URL: https://issues.apache.org/jira/browse/SPARK-35563 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.2 > Reporter: Robert Joseph Evans > Priority: Blocker > Labels: data-loss > > I think this impacts a lot more versions of Spark, but I don't know for sure > because it takes a long time to test. As a part of doing corner case > validation testing for spark rapids I found that if a window function has > more than {{Int.MaxValue + 1}} rows the result is silently truncated to that > many rows. I have only tested this on 3.0.2 with {{row_number}}, but I > suspect it will impact others as well. This is a really rare corner case, but > because it is silent data corruption I personally think it is quite serious. > {code:scala} > import org.apache.spark.sql.expressions.Window > val windowSpec = Window.partitionBy("a").orderBy("b") > val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as > b") > spark.time(df.select(col("a"), col("b"), > row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), > desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20)) > +-----+----------+ > > | dir| count| > +-----+----------+ > |false|2147483647| > | true| 1| > +-----+----------+ > Time taken: 1139089 ms > Int.MaxValue.toLong + 100 > res15: Long = 2147483747 > 2147483647L + 1 > res16: Long = 2147483648 > {code} > I had to make sure that I ran the above with at least 64GiB of heap for the > executor (I did it in local mode and it worked, but took forever to run) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org