[ 
https://issues.apache.org/jira/browse/SPARK-53870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Teoh updated SPARK-53870:
-------------------------------
    Description: 
The TransformWithState StateServer's {{parseProtoMessage}} method uses {{read}} 
(InputStream/FilterInputStream) which only reads all available data and may not 
return the full message. We should be using the [readFully DataInputStream 
API|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)]
 instead, which will continue fetching until it fills up the provided buffer.

In addition to the linked API above, this StackOverflow post also illustrates 
the difference between the two APIs: [https://stackoverflow.com/a/25900095]

Without this change, it is possible for the state server to fail to fully read 
large proto messages (e.g., those containing a large state value update) and 
run into a parsing error.

 

Affected versions identified by the tags on the original PR, it seems to have 
been present since the state server was introduced: 
[https://github.com/apache/spark/commit/def42d44405af5df78c3039ac5ad0f8a0469efaa]

 

In practice this seems like an uncommon scenario (bug was identified/confirmed 
with a 512KB string state value update which likely produces a proto message 
much larger than typical use cases)

 

  was:
The TransformWithState StateServer's {{parseProtoMessage}} method uses {{read}} 
(InputStream/FilterInputStream) which only reads all available data and may not 
return the full message. We should be using the [readFully DataInputStream 
API|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)]
 instead, which will continue fetching until it fills up the provided buffer.

In addition to the linked API above, this StackOverflow post also illustrates 
the difference between the two APIs: [https://stackoverflow.com/a/25900095]

Without this change, it is possible for the state server to fail to fully read 
large proto messages (e.g., those containing a large state value update) and 
run into a parsing error.

 

Affected versions identified by the tags on the original PR: 
[https://github.com/apache/spark/commit/def42d44405af5df78c3039ac5ad0f8a0469efaa]

 


> Python streaming transform_with_state StateServer does not fully read large 
> state values
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-53870
>                 URL: https://issues.apache.org/jira/browse/SPARK-53870
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 4.1.0, 4.0.0, 4.0.1
>            Reporter: Jason Teoh
>            Priority: Major
>
> The TransformWithState StateServer's {{parseProtoMessage}} method uses 
> {{read}} (InputStream/FilterInputStream) which only reads all available data 
> and may not return the full message. We should be using the [readFully 
> DataInputStream 
> API|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)]
>  instead, which will continue fetching until it fills up the provided buffer.
> In addition to the linked API above, this StackOverflow post also illustrates 
> the difference between the two APIs: [https://stackoverflow.com/a/25900095]
> Without this change, it is possible for the state server to fail to fully 
> read large proto messages (e.g., those containing a large state value update) 
> and run into a parsing error.
>  
> Affected versions identified by the tags on the original PR, it seems to have 
> been present since the state server was introduced: 
> [https://github.com/apache/spark/commit/def42d44405af5df78c3039ac5ad0f8a0469efaa]
>  
> In practice this seems like an uncommon scenario (bug was 
> identified/confirmed with a 512KB string state value update which likely 
> produces a proto message much larger than typical use cases)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to