[
https://issues.apache.org/jira/browse/FLINK-39951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39951:
-----------------------------------
Labels: pull-request-available (was: )
> [Python] ArrayConstructor uses == for String comparison, silently truncating
> long array values to int
> -----------------------------------------------------------------------------------------------------
>
> Key: FLINK-39951
> URL: https://issues.apache.org/jira/browse/FLINK-39951
> Project: Flink
> Issue Type: Bug
> Components: API / Python
> Reporter: Jubin Soni
> Priority: Major
> Labels: pull-request-available
>
> *Summary:*
> ArrayConstructor uses String reference equality ({{{}=={}}}) instead of value
> equality ({{{}.equals(){}}}) for Python {{'l'}} typecode, causing incorrect
> deserialization of long arrays.
> *Description:*
> In {{{}ArrayConstructor.java{}}}, the typecode check uses reference equality
> ({{{}=={}}}) rather than value equality ({{{}.equals(){}}}) when checking for
> Python's {{'l'}} (long) typecode:
>
> {{if (args.length == 2 && args[0] == "l") {}}
> *File:*
> [https://github.com/apache/flink/blob/master/flink-python/src/main/java/org/apache/flink/api/common/python/pickle/ArrayConstructor.java]
> Line: 30
> Because {{args[0]}} is a deserialized {{String}} object at runtime, it is not
> guaranteed to be the same interned instance as the string literal
> {{{}"l"{}}}. As a result, the comparison evaluates to {{{}false{}}}, making
> the {{long[]}} handling path effectively unreachable.
> Consequently, arrays with typecode {{'l'}} fall through to
> {{{}super.construct(){}}}, resulting in incorrect deserialization behavior.
> *Steps to Reproduce:*
> # Create a Python array with typecode {{'l'}} containing values larger than
> {{Integer.MAX_VALUE}} (for example, {{{}3000000000{}}}).
> # Pass the array through Flink's Python-to-Java
> serialization/deserialization path.
> # Read the resulting values on the Java side.
> *Expected Result:*
> Values are preserved as 64-bit longs and deserialized correctly.
> *Actual Result:*
> The {{'l'}} typecode branch is never taken, and values may be incorrectly
> handled, potentially resulting in truncation or corruption when large values
> are processed.
> *Impact:*
> This can lead to silent data corruption for Python arrays containing 64-bit
> integer values. Users may receive incorrect results without any exception or
> warning, particularly when values exceed the 32-bit integer range.
> *Proposed Fix:*
> Replace:
>
> {{if (args.length == 2 && args[0] == "l") {}}
> with:
>
> {{if (args.length == 2 && "l".equals(args[0])) {}}
> This correctly performs value-based string comparison and ensures the
> intended {{long[]}} deserialization path is executed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)