Jason Teoh created SPARK-53870:
----------------------------------

             Summary: Python streaming transform_with_state StateServer does 
not fully read large state values
                 Key: SPARK-53870
                 URL: https://issues.apache.org/jira/browse/SPARK-53870
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 4.0.1, 4.0.0, 4.1.0
            Reporter: Jason Teoh


The TransformWithState StateServer's {{parseProtoMessage}} method uses {{read}} 
(InputStream/FilterInputStream) which only reads all available data and may not 
return the full message. We should be using the [readFully DataInputStream 
API|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)]
 instead, which will continue fetching until it fills up the provided buffer.

In addition to the linked API above, this StackOverflow post also illustrates 
the difference between the two APIs: [https://stackoverflow.com/a/25900095]

Without this change, it is possible for the state server to fail to fully read 
large proto messages (e.g., those containing a large state value update) and 
run into a parsing error.

 

Affected versions identified by the tags on the original PR: 
[https://github.com/apache/spark/commit/def42d44405af5df78c3039ac5ad0f8a0469efaa]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to