Final reminder. I'll leave this thread for a couple of days to see further
voices, and go forward if there is no outstanding comment.

On Sat, Jul 9, 2022 at 9:54 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> It sounds like none of the approaches perfectly solve the issue of
> backfill.
>
> 1. Trigger.Once: scale issue
> 2. Trigger.AvailbleNow: watermark advancement issue (data getting dropped
> due to watermark) depending on the order of data
> 3. Manual batch: state is not built from processing backfill
>
> Handling a huge data (backfill) with a single microbatch without advancing
> the watermark also requires thinking of "backfill-specific" situations -
> state can grow unexpectedly since there is no way to purge without
> watermark advancement. There seems to be not really a good approach to
> solve all of the issues smoothly. One easier way as of now is to use
> RocksDB state store provider to tolerate the huge size of state while we
> enforce to not advance watermark, but the ideal approach still really
> depends on the data source and the volume of the data to backfill.
>
> Btw, don't worry if you get a feeling the deprecated API may get removed
> too soon! Removing the API would require another serious discussion and
> Spark community is generally not in favor of removing existing API.
>
> 2022년 7월 8일 (금) 오후 11:21, Adam Binford <adam...@gmail.com>님이 작성:
>
> Dang I was hoping it was the second one. In our case the data is too large
>> to run the whole backfill for the aggregation in a single batch (the
>> shuffle is too big). We currently resort to manually batching (i.e. not
>> streaming) the backlog (anything older than the watermark) when we need to
>> reprocess, because we can't really know for sure our batches are processed
>> in the correct event time order when starting from scratch.
>>
>> I'm not against deprecating Trigger.Once, just wanted to chime in that
>> someone was using it! I'm itching to upgrade and try out the new stuff.
>>
>> Adam
>>
>> On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> Thanks for the input, Adam! Replying inline.
>>>
>>> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford <adam...@gmail.com> wrote:
>>>
>>>> We use Trigger.Once a lot, usually for backfilling data for new
>>>> streams. I feel like I could see a continuing use case for "ignore trigger
>>>> limits for this batch" (ignoring the whole issue with re-running the last
>>>> failed batch vs a new batch), but we haven't actually been able to upgrade
>>>> yet and try out Trigger.AvailableNow, so that could end up replacing all
>>>> our use cases.
>>>>
>>>> One question I did have is how it does (or is supposed to) handle
>>>> watermarking. Is the watermark determined for each batch independently like
>>>> a normal stream, or is it kept constant for all batches in a single
>>>> AvailableNow run? For example, we have a stateful job that we need to rerun
>>>> occasionally, and it takes ~6 batches to backfill all the data before
>>>> catching up to live data. With a Trigger.Once we know we won't accidentally
>>>> drop any data due to the watermark when backfilling, because it's a single
>>>> batch with no watermark yet. Would the same hold true if we backfill with
>>>> Trigger.AvailableNow instead?
>>>>
>>>
>>> The behavior is the former one. Each batch advances the watermark and
>>> it's immediately reflected on the next batch.
>>>
>>> The number of batches Trigger.AvailableNow will execute depends on the
>>> data source and the source option. For example, if you use Kafka data
>>> source and use Trigger.AvailableNow without specifying any source option on
>>> limiting the size, Trigger.AvailableNow will process all newly available
>>> data as a single microbatch. It may not be still a single microbatch - it
>>> would also handle the batch already logged in WAL first if any, as well as
>>> handle no-data batch after the run of all microbatches. But I guess these
>>> additional batches wouldn't hurt your case.
>>>
>>> If the data source doesn't allow processing all available data within a
>>> single microbatch (depending on the implementation of default read limit),
>>> you could probably either 1) set source options regarding to limit size as
>>> an unrealistic one to enforce a single batch or 2) set the delay of
>>> watermark as an unrealistic one. Both of the workarounds require you to use
>>> different source options/watermark configuration for backfill vs normal run
>>> - I agree it wouldn’t be a smooth one.
>>>
>>> This proposal does not aim to remove Trigger.Once in near future. As
>>> long as we deprecate Trigger.Once, we would get some reports for use cases
>>> Trigger.Once may work better (like your case) for the time period across
>>> several minor releases, and then we can really decide. (IMHO, handling
>>> backfill with Trigger.Once sounds to me as a workaround. Backfill may
>>> warrant its own design to deal with.)
>>>
>>>
>>>>
>>>> Adam
>>>>
>>>> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Bump to get a chance to expose the proposal to wider audiences.
>>>>>
>>>>> Given that there are not many active contributors/maintainers in area
>>>>> Structured Streaming, I'd consider the discussion as "lazy consensus" to
>>>>> avoid being stuck. I'll give a final reminder early next week, and move
>>>>> forward if there are no outstanding objections.
>>>>>
>>>>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hi dev,
>>>>>>
>>>>>> I would like to hear voices about deprecating Trigger.Once, and
>>>>>> promoting Trigger.AvailableNow as a replacement [1] in Structured 
>>>>>> Streaming.
>>>>>> (It doesn't mean we remove Trigger.Once now or near future. It
>>>>>> probably requires another discussion at some time.)
>>>>>>
>>>>>> Rationalization:
>>>>>>
>>>>>> The expected behavior of Trigger.Once is like reading all available
>>>>>> data after the last trigger and processing them. This holds true when the
>>>>>> last run was gracefully terminated, but there are cases streaming queries
>>>>>> to not be terminated gracefully. There is a possibility the last run may
>>>>>> write the offset for the new batch before termination, then a new run of
>>>>>> Trigger.Once only processes the data which was built in the latest
>>>>>> unfinished batch and doesn't process new data.
>>>>>>
>>>>>> The behavior is not deterministic from the users' point of view, as
>>>>>> end users wouldn't know whether the last run wrote the offset or not,
>>>>>> unless they look into the query's checkpoint by themselves.
>>>>>>
>>>>>> While Trigger.AvailableNow came to solve the scalability issue on
>>>>>> Trigger.Once, it also ensures that it tries to process all available data
>>>>>> at the point of time it is triggered, which consistently works as 
>>>>>> expected
>>>>>> behavior of Trigger.Once.
>>>>>>
>>>>>> Another issue on Trigger.Once is that it does not trigger a no-data
>>>>>> batch immediately. When the watermark is calculated in batch N, it takes
>>>>>> effect in batch N + 1. If the query is scheduled to be run per day, you 
>>>>>> can
>>>>>> see the output from the new watermark in the query run the next day. 
>>>>>> Thanks
>>>>>> to the behavior of Trigger.AvailableNow, it handles no-data batch as well
>>>>>> before termination of the query.
>>>>>>
>>>>>> Please review and let us know if you have any feedback or concerns on
>>>>>> the proposal.
>>>>>>
>>>>>> Thanks!
>>>>>> Jungtaek Lim
>>>>>>
>>>>>> 1. https://issues.apache.org/jira/browse/SPARK-36533
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Adam Binford
>>>>
>>>
>>
>> --
>> Adam Binford
>>
>

Reply via email to