[ 
https://issues.apache.org/jira/browse/SPARK-35563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368096#comment-17368096
 ] 

Robert Joseph Evans commented on SPARK-35563:
---------------------------------------------

Yes, technically if we switch it from an int to a long then we will have a 
similar problem with LONG_MAX.  But that kicks the can down the road a very 
long ways. With the current spark memory layout for unsafe row where there is a 
long for nullability followed by a long for each column (possibly more) for a 
single column dataframe we would need 32 exabytes of memory to hold this window 
before we hit the problem.  But yes we should look at doing an overflow check 
as well. I just would want to measure the performance impact of it so we can 
make an informed decision.

> [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-35563
>                 URL: https://issues.apache.org/jira/browse/SPARK-35563
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.2
>            Reporter: Robert Joseph Evans
>            Priority: Blocker
>              Labels: data-loss
>
> I think this impacts a lot more versions of Spark, but I don't know for sure 
> because it takes a long time to test. As a part of doing corner case 
> validation testing for spark rapids I found that if a window function has 
> more than {{Int.MaxValue + 1}} rows the result is silently truncated to that 
> many rows. I have only tested this on 3.0.2 with {{row_number}}, but I 
> suspect it will impact others as well. This is a really rare corner case, but 
> because it is silent data corruption I personally think it is quite serious.
> {code:scala}
> import org.apache.spark.sql.expressions.Window
> val windowSpec = Window.partitionBy("a").orderBy("b")
> val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as 
> b")
> spark.time(df.select(col("a"), col("b"), 
> row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
> desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
> +-----+----------+                                                            
>   
> |  dir|     count|
> +-----+----------+
> |false|2147483647|
> | true|         1|
> +-----+----------+
> Time taken: 1139089 ms
> Int.MaxValue.toLong + 100
> res15: Long = 2147483747
> 2147483647L + 1
> res16: Long = 2147483648
> {code}
> I had to make sure that I ran the above with at least 64GiB of heap for the 
> executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to