[ 
https://issues.apache.org/jira/browse/SPARK-49699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-49699:
---------------------------------
    Description: 
PruneFilters replaces the {{null}} / {{false}} filter with an empty relation, 
which means the subtree of the filter is also lost. The optimization does not 
care about whichever operator is in the subtree, hence some important operators 
like stateful operator, watermark node, observe node could be lost.

The filter could be evaluated to {{null}} / {{false}} selectively among 
microbatches in various reasons (one simple example is the modification of the 
query during restart), which means stateful operator might not be available for 
batch N and be available for batch N + 1. For this case, streaming query will 
fail as batch N + 1 cannot load the state from batch N, and it's not 
recoverable in most cases.

We have to disable the rule for streaming workloads, with the consideration of 
backward compatibility - we should avoid breaking existing query.

  was:Various optimizer rules are unsuitable for the streaming or side-effect 
settings. Disable them, and roll out the disablement with care as to not break 
existing queries.


> Disable PruneFilters for streaming workloads
> --------------------------------------------
>
>                 Key: SPARK-49699
>                 URL: https://issues.apache.org/jira/browse/SPARK-49699
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, Structured Streaming
>    Affects Versions: 4.0.0
>            Reporter: Nick Young
>            Assignee: Nick Young
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0, 3.5.3
>
>
> PruneFilters replaces the {{null}} / {{false}} filter with an empty relation, 
> which means the subtree of the filter is also lost. The optimization does not 
> care about whichever operator is in the subtree, hence some important 
> operators like stateful operator, watermark node, observe node could be lost.
> The filter could be evaluated to {{null}} / {{false}} selectively among 
> microbatches in various reasons (one simple example is the modification of 
> the query during restart), which means stateful operator might not be 
> available for batch N and be available for batch N + 1. For this case, 
> streaming query will fail as batch N + 1 cannot load the state from batch N, 
> and it's not recoverable in most cases.
> We have to disable the rule for streaming workloads, with the consideration 
> of backward compatibility - we should avoid breaking existing query.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to