Hi dev, I would like to hear voices about deprecating Trigger.Once, and replacing it with Trigger.AvailableNow [1] in Structured Streaming.
Rationalization: The expected behavior of Trigger.Once is like reading all available data after the last trigger and processing them. This holds true when the last run was gracefully terminated, but there are cases streaming queries to not be terminated gracefully. There is a possibility the last run may write the offset (WAL) for the new batch before termination, then a new run of Trigger.Once only processes the data which was built in the latest unfinished batch, and doesn't process new data. The behavior is not deterministic from the users' point of view, as end users wouldn't know whether the last run wrote the offset or not, unless they look into the query's checkpoint by themselves. While Trigger.AvailableNow came to solve the scalability issue on Trigger.Once, it also ensures that it tries to process all available data at the point of time it is triggered, which consistently works as expected behavior of Trigger.Once. Proposed Plan: - Deprecate Trigger.Once in Apache Spark 3.3 - Leave guidance to migrate to Trigger.AvailableNow in migration guide - Replace all usages of Trigger.Once with Trigger.AvailableNow, except the test cases of Trigger.Once itself Please review the proposal and share your voice on this. Thanks! Jungtaek Lim 1. https://issues.apache.org/jira/browse/SPARK-36533