Deepayan Patra created SPARK-43392:
--------------------------------------

             Summary: Sequence expression can overflow
                 Key: SPARK-43392
                 URL: https://issues.apache.org/jira/browse/SPARK-43392
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Deepayan Patra


Spark has a (long-standing) overflow bug in the {{sequence}} expression.

 

Consider the following operations:

{{spark.sql("CREATE TABLE foo (l LONG);")}}
{{spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")}}
{{spark.sql("SELECT sequence(0, l) FROM foo;").collect()}}

 

The result of these operations will be:

{{Array[org.apache.spark.sql.Row] = Array([WrappedArray()])}}

an unintended consequence of overflow.

 

The sequence is applied to values {{0}} and {{Long.MaxValue}} with a step size 
of {{1}} which uses a length computation defined 
[here|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451].
 In this calculation, with {{{}start = 0{}}}, {{{}stop = Long.MaxValue{}}}, and 
{{{}step = 1{}}}, the calculated {{len}} overflows to {{{}Long.MinValue{}}}. 
The computation, in binary looks like:

{{  0111111111111111111111111111111111111111111111111111111111111111}}

{{- 0000000000000000000000000000000000000000000000000000000000000000 }}

{{------------------------------------------------------------------      
0111111111111111111111111111111111111111111111111111111111111111}}

{{/ 0000000000000000000000000000000000000000000000000000000000000001}}

{{------------------------------------------------------------------      
0111111111111111111111111111111111111111111111111111111111111111}}

{{+ 0000000000000000000000000000000000000000000000000000000000000001}}

{{------------------------------------------------------------------      
1000000000000000000000000000000000000000000000000000000000000000}}

{{}}

The following 
[check|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454]
 passes as the negative {{Long.MinValue}} is still {{{}<= 
MAX_ROUNDED_ARRAY_LENGTH{}}}. The following cast to {{toInt}} uses this 
representation and [truncates the upper 
bits|https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457]
 resulting in an empty length of 0.{{{}{}}}

Other overflows are similarly problematic.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to