Jubin Soni created FLINK-39951:
----------------------------------

             Summary: [Python] ArrayConstructor uses == for String comparison, 
silently truncating long array values to int
                 Key: FLINK-39951
                 URL: https://issues.apache.org/jira/browse/FLINK-39951
             Project: Flink
          Issue Type: Bug
          Components: API / Python
            Reporter: Jubin Soni


*Summary:*
ArrayConstructor uses String reference equality ({{{}=={}}}) instead of value 
equality ({{{}.equals(){}}}) for Python {{'l'}} typecode, causing incorrect 
deserialization of long arrays.

*Description:*
In {{{}ArrayConstructor.java{}}}, the typecode check uses reference equality 
({{{}=={}}}) rather than value equality ({{{}.equals(){}}}) when checking for 
Python's {{'l'}} (long) typecode:

 

{{if (args.length == 2 && args[0] == "l") {}}

*File:*
[https://github.com/apache/flink/blob/master/flink-python/src/main/java/org/apache/flink/api/common/python/pickle/ArrayConstructor.java]

Line: 30

Because {{args[0]}} is a deserialized {{String}} object at runtime, it is not 
guaranteed to be the same interned instance as the string literal {{{}"l"{}}}. 
As a result, the comparison evaluates to {{{}false{}}}, making the {{long[]}} 
handling path effectively unreachable.

Consequently, arrays with typecode {{'l'}} fall through to 
{{{}super.construct(){}}}, resulting in incorrect deserialization behavior.

*Steps to Reproduce:*
 # Create a Python array with typecode {{'l'}} containing values larger than 
{{Integer.MAX_VALUE}} (for example, {{{}3000000000{}}}).

 # Pass the array through Flink's Python-to-Java serialization/deserialization 
path.

 # Read the resulting values on the Java side.

*Expected Result:*
Values are preserved as 64-bit longs and deserialized correctly.

*Actual Result:*
The {{'l'}} typecode branch is never taken, and values may be incorrectly 
handled, potentially resulting in truncation or corruption when large values 
are processed.

*Impact:*
This can lead to silent data corruption for Python arrays containing 64-bit 
integer values. Users may receive incorrect results without any exception or 
warning, particularly when values exceed the 32-bit integer range.

*Proposed Fix:*

Replace:

 

{{if (args.length == 2 && args[0] == "l") {}}

with:

 

{{if (args.length == 2 && "l".equals(args[0])) {}}

This correctly performs value-based string comparison and ensures the intended 
{{long[]}} deserialization path is executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to