Frans King created FLINK-27934:
----------------------------------
Summary: Python API- Inefficient deserialization/serialization of
state variables within a batch
Key: FLINK-27934
URL: https://issues.apache.org/jira/browse/FLINK-27934
Project: Flink
Issue Type: Improvement
Components: Stateful Functions
Affects Versions: statefun-3.2.0
Reporter: Frans King
In the Python API state variables can be accessed via the UserFacingContext:
variable = context.storage.variable
This calls into the Cell instance for that state variable which has get() &
set() methods. The get() method always deserializes from the typed_value and
the set() always re-serializes and marks the cell dirty.
This has two side effects
1:
var1 = context.storage.variable
var2 = context.storage.variable
var2 != var1 - they are different instances
2:
In a large batch (say 1000 calls to the same function type and id) this can
result in deserializing and re-serializing the same same state variable 1000
times when really it only needs to be deserialized in the first invocation in
the batch, held in memory until the last invocation and then re-serialized
prior to collecting the mutations.
I think this can be improved by having a lazily initialized backing field in
the Cell class but I don't know if this behavior was a conscious design
decision to have the behavior described in 1.
Any feedback would be welcome.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)