[ 
https://issues.apache.org/jira/browse/SPARK-53870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-53870:
---------------------------------
    Fix Version/s: 4.0.2

> Python streaming transform_with_state StateServer does not fully read large 
> state values
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-53870
>                 URL: https://issues.apache.org/jira/browse/SPARK-53870
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 4.1.0, 4.0.0, 4.0.1
>            Reporter: Jason Teoh
>            Assignee: Jason Teoh
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.1.0, 4.0.2
>
>
> The TransformWithState StateServer's {{parseProtoMessage}} method uses 
> {{read}} (InputStream/FilterInputStream) which only reads all available data 
> and may not return the full message. We should be using the [readFully 
> DataInputStream 
> API|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)]
>  instead, which will continue fetching until it fills up the provided buffer.
> In addition to the linked API above, this StackOverflow post also illustrates 
> the difference between the two APIs: [https://stackoverflow.com/a/25900095]
> Without this change, it is possible for the state server to fail to fully 
> read large proto messages (e.g., those containing a large state value update) 
> and run into a parsing error.
>  
> Affected versions identified by the tags on the original PR, it seems to have 
> been present since the state server was introduced: 
> [https://github.com/apache/spark/commit/def42d44405af5df78c3039ac5ad0f8a0469efaa]
>  
> In practice this seems like an uncommon scenario (bug was 
> identified/confirmed with a 512KB string state value update which likely 
> produces a proto message much larger than typical use cases)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to