Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

Adam Binford Fri, 08 Jul 2022 04:48:52 -0700

We use Trigger.Once a lot, usually for backfilling data for new streams. I
feel like I could see a continuing use case for "ignore trigger limits for
this batch" (ignoring the whole issue with re-running the last failed batch
vs a new batch), but we haven't actually been able to upgrade yet and try
out Trigger.AvailableNow, so that could end up replacing all our use cases.


One question I did have is how it does (or is supposed to) handle
watermarking. Is the watermark determined for each batch independently like
a normal stream, or is it kept constant for all batches in a single
AvailableNow run? For example, we have a stateful job that we need to rerun
occasionally, and it takes ~6 batches to backfill all the data before
catching up to live data. With a Trigger.Once we know we won't accidentally
drop any data due to the watermark when backfilling, because it's a single
batch with no watermark yet. Would the same hold true if we backfill with
Trigger.AvailableNow instead?

Adam

On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim <[email protected]>
wrote:

> Bump to get a chance to expose the proposal to wider audiences.
>
> Given that there are not many active contributors/maintainers in area
> Structured Streaming, I'd consider the discussion as "lazy consensus" to
> avoid being stuck. I'll give a final reminder early next week, and move
> forward if there are no outstanding objections.
>
> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim <[email protected]>
> wrote:
>
>> Hi dev,
>>
>> I would like to hear voices about deprecating Trigger.Once, and promoting
>> Trigger.AvailableNow as a replacement [1] in Structured Streaming.
>> (It doesn't mean we remove Trigger.Once now or near future. It probably
>> requires another discussion at some time.)
>>
>> Rationalization:
>>
>> The expected behavior of Trigger.Once is like reading all available data
>> after the last trigger and processing them. This holds true when the last
>> run was gracefully terminated, but there are cases streaming queries to not
>> be terminated gracefully. There is a possibility the last run may write the
>> offset for the new batch before termination, then a new run of Trigger.Once
>> only processes the data which was built in the latest unfinished batch and
>> doesn't process new data.
>>
>> The behavior is not deterministic from the users' point of view, as end
>> users wouldn't know whether the last run wrote the offset or not, unless
>> they look into the query's checkpoint by themselves.
>>
>> While Trigger.AvailableNow came to solve the scalability issue on
>> Trigger.Once, it also ensures that it tries to process all available data
>> at the point of time it is triggered, which consistently works as expected
>> behavior of Trigger.Once.
>>
>> Another issue on Trigger.Once is that it does not trigger a no-data batch
>> immediately. When the watermark is calculated in batch N, it takes effect
>> in batch N + 1. If the query is scheduled to be run per day, you can see
>> the output from the new watermark in the query run the next day. Thanks to
>> the behavior of Trigger.AvailableNow, it handles no-data batch as well
>> before termination of the query.
>>
>> Please review and let us know if you have any feedback or concerns on the
>> proposal.
>>
>> Thanks!
>> Jungtaek Lim
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-36533
>>
>

-- 
Adam Binford

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

Reply via email to