Jason Teoh created SPARK-53870:
----------------------------------
Summary: Python streaming transform_with_state StateServer does
not fully read large state values
Key: SPARK-53870
URL: https://issues.apache.org/jira/browse/SPARK-53870
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 4.0.1, 4.0.0, 4.1.0
Reporter: Jason Teoh
The TransformWithState StateServer's {{parseProtoMessage}} method uses {{read}}
(InputStream/FilterInputStream) which only reads all available data and may not
return the full message. We should be using the [readFully DataInputStream
API|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)]
instead, which will continue fetching until it fills up the provided buffer.
In addition to the linked API above, this StackOverflow post also illustrates
the difference between the two APIs: [https://stackoverflow.com/a/25900095]
Without this change, it is possible for the state server to fail to fully read
large proto messages (e.g., those containing a large state value update) and
run into a parsing error.
Affected versions identified by the tags on the original PR:
[https://github.com/apache/spark/commit/def42d44405af5df78c3039ac5ad0f8a0469efaa]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]