We use Trigger.Once a lot, usually for backfilling data for new streams. I feel like I could see a continuing use case for "ignore trigger limits for this batch" (ignoring the whole issue with re-running the last failed batch vs a new batch), but we haven't actually been able to upgrade yet and try out Trigger.AvailableNow, so that could end up replacing all our use cases.
One question I did have is how it does (or is supposed to) handle watermarking. Is the watermark determined for each batch independently like a normal stream, or is it kept constant for all batches in a single AvailableNow run? For example, we have a stateful job that we need to rerun occasionally, and it takes ~6 batches to backfill all the data before catching up to live data. With a Trigger.Once we know we won't accidentally drop any data due to the watermark when backfilling, because it's a single batch with no watermark yet. Would the same hold true if we backfill with Trigger.AvailableNow instead? Adam On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Bump to get a chance to expose the proposal to wider audiences. > > Given that there are not many active contributors/maintainers in area > Structured Streaming, I'd consider the discussion as "lazy consensus" to > avoid being stuck. I'll give a final reminder early next week, and move > forward if there are no outstanding objections. > > On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> Hi dev, >> >> I would like to hear voices about deprecating Trigger.Once, and promoting >> Trigger.AvailableNow as a replacement [1] in Structured Streaming. >> (It doesn't mean we remove Trigger.Once now or near future. It probably >> requires another discussion at some time.) >> >> Rationalization: >> >> The expected behavior of Trigger.Once is like reading all available data >> after the last trigger and processing them. This holds true when the last >> run was gracefully terminated, but there are cases streaming queries to not >> be terminated gracefully. There is a possibility the last run may write the >> offset for the new batch before termination, then a new run of Trigger.Once >> only processes the data which was built in the latest unfinished batch and >> doesn't process new data. >> >> The behavior is not deterministic from the users' point of view, as end >> users wouldn't know whether the last run wrote the offset or not, unless >> they look into the query's checkpoint by themselves. >> >> While Trigger.AvailableNow came to solve the scalability issue on >> Trigger.Once, it also ensures that it tries to process all available data >> at the point of time it is triggered, which consistently works as expected >> behavior of Trigger.Once. >> >> Another issue on Trigger.Once is that it does not trigger a no-data batch >> immediately. When the watermark is calculated in batch N, it takes effect >> in batch N + 1. If the query is scheduled to be run per day, you can see >> the output from the new watermark in the query run the next day. Thanks to >> the behavior of Trigger.AvailableNow, it handles no-data batch as well >> before termination of the query. >> >> Please review and let us know if you have any feedback or concerns on the >> proposal. >> >> Thanks! >> Jungtaek Lim >> >> 1. https://issues.apache.org/jira/browse/SPARK-36533 >> > -- Adam Binford