Jungtaek Lim created SPARK-57003:
------------------------------------
Summary: Enforce streaming stateful operator output and
state-schema nullability
Key: SPARK-57003
URL: https://issues.apache.org/jira/browse/SPARK-57003
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 4.3.0
Reporter: Jungtaek Lim
This has been a long standing issue of streaming engine vs Query Optimizer.
By the nature of streaming query, the query is meant to be long-running, in
many cases spans to multiple Spark versions. Also, the logical plan is not
always the same across batches (e.g. there are multiple stream sources and one
of the source does not have a new data at batch N). This puts the streaming
query to be affected by analyzer and optimizer.
The state schema of stateful operator is mostly determined by the input schema
of the stateful operator, and nullability isn't an exception. If the input
schema has a nullable column, state schema would have a nullable column. Vice
versa with non-nullable column.
For Query Optimizer, one of the optimizations is to flip the nullability, say,
nullable to non-nullable if appropriate. (This can be done directly or
indirectly - one of indirect example is determining the nullability of the
column when union is eliminated with one side) If this optimization can happen
conditionally, the nullability of the column can change over time, which breaks
stateful operator.
The holistic fix of this is to loose the nullability of state schema to be
always nullable, and propagate this to the output schema of the stateful
operator, which should be also propagated to the downstream operators.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]