Friendly reminder. I'll submit the proposed change if there is no objection observed this week.
On Wed, Dec 8, 2021 at 4:16 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Hi dev, > > I would like to hear voices about deprecating Trigger.Once, and replacing > it with Trigger.AvailableNow [1] in Structured Streaming. > > Rationalization: > > The expected behavior of Trigger.Once is like reading all available data > after the last trigger and processing them. This holds true when the last > run was gracefully terminated, but there are cases streaming queries to not > be terminated gracefully. There is a possibility the last run may write the > offset (WAL) for the new batch before termination, then a new run of > Trigger.Once only processes the data which was built in the latest > unfinished batch, and doesn't process new data. > > The behavior is not deterministic from the users' point of view, as end > users wouldn't know whether the last run wrote the offset or not, unless > they look into the query's checkpoint by themselves. > > While Trigger.AvailableNow came to solve the scalability issue on > Trigger.Once, it also ensures that it tries to process all available data > at the point of time it is triggered, which consistently works as expected > behavior of Trigger.Once. > > Proposed Plan: > > - Deprecate Trigger.Once in Apache Spark 3.3 > - Leave guidance to migrate to Trigger.AvailableNow in migration guide > - Replace all usages of Trigger.Once with Trigger.AvailableNow, except the > test cases of Trigger.Once itself > > Please review the proposal and share your voice on this. > > Thanks! > Jungtaek Lim > > 1. https://issues.apache.org/jira/browse/SPARK-36533 >