Final reminder. I'll leave this thread for a couple of days to see further voices, and go forward if there is no outstanding comment.
On Sat, Jul 9, 2022 at 9:54 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > It sounds like none of the approaches perfectly solve the issue of > backfill. > > 1. Trigger.Once: scale issue > 2. Trigger.AvailbleNow: watermark advancement issue (data getting dropped > due to watermark) depending on the order of data > 3. Manual batch: state is not built from processing backfill > > Handling a huge data (backfill) with a single microbatch without advancing > the watermark also requires thinking of "backfill-specific" situations - > state can grow unexpectedly since there is no way to purge without > watermark advancement. There seems to be not really a good approach to > solve all of the issues smoothly. One easier way as of now is to use > RocksDB state store provider to tolerate the huge size of state while we > enforce to not advance watermark, but the ideal approach still really > depends on the data source and the volume of the data to backfill. > > Btw, don't worry if you get a feeling the deprecated API may get removed > too soon! Removing the API would require another serious discussion and > Spark community is generally not in favor of removing existing API. > > 2022년 7월 8일 (금) 오후 11:21, Adam Binford <adam...@gmail.com>님이 작성: > > Dang I was hoping it was the second one. In our case the data is too large >> to run the whole backfill for the aggregation in a single batch (the >> shuffle is too big). We currently resort to manually batching (i.e. not >> streaming) the backlog (anything older than the watermark) when we need to >> reprocess, because we can't really know for sure our batches are processed >> in the correct event time order when starting from scratch. >> >> I'm not against deprecating Trigger.Once, just wanted to chime in that >> someone was using it! I'm itching to upgrade and try out the new stuff. >> >> Adam >> >> On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> >> wrote: >> >>> Thanks for the input, Adam! Replying inline. >>> >>> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford <adam...@gmail.com> wrote: >>> >>>> We use Trigger.Once a lot, usually for backfilling data for new >>>> streams. I feel like I could see a continuing use case for "ignore trigger >>>> limits for this batch" (ignoring the whole issue with re-running the last >>>> failed batch vs a new batch), but we haven't actually been able to upgrade >>>> yet and try out Trigger.AvailableNow, so that could end up replacing all >>>> our use cases. >>>> >>>> One question I did have is how it does (or is supposed to) handle >>>> watermarking. Is the watermark determined for each batch independently like >>>> a normal stream, or is it kept constant for all batches in a single >>>> AvailableNow run? For example, we have a stateful job that we need to rerun >>>> occasionally, and it takes ~6 batches to backfill all the data before >>>> catching up to live data. With a Trigger.Once we know we won't accidentally >>>> drop any data due to the watermark when backfilling, because it's a single >>>> batch with no watermark yet. Would the same hold true if we backfill with >>>> Trigger.AvailableNow instead? >>>> >>> >>> The behavior is the former one. Each batch advances the watermark and >>> it's immediately reflected on the next batch. >>> >>> The number of batches Trigger.AvailableNow will execute depends on the >>> data source and the source option. For example, if you use Kafka data >>> source and use Trigger.AvailableNow without specifying any source option on >>> limiting the size, Trigger.AvailableNow will process all newly available >>> data as a single microbatch. It may not be still a single microbatch - it >>> would also handle the batch already logged in WAL first if any, as well as >>> handle no-data batch after the run of all microbatches. But I guess these >>> additional batches wouldn't hurt your case. >>> >>> If the data source doesn't allow processing all available data within a >>> single microbatch (depending on the implementation of default read limit), >>> you could probably either 1) set source options regarding to limit size as >>> an unrealistic one to enforce a single batch or 2) set the delay of >>> watermark as an unrealistic one. Both of the workarounds require you to use >>> different source options/watermark configuration for backfill vs normal run >>> - I agree it wouldn’t be a smooth one. >>> >>> This proposal does not aim to remove Trigger.Once in near future. As >>> long as we deprecate Trigger.Once, we would get some reports for use cases >>> Trigger.Once may work better (like your case) for the time period across >>> several minor releases, and then we can really decide. (IMHO, handling >>> backfill with Trigger.Once sounds to me as a workaround. Backfill may >>> warrant its own design to deal with.) >>> >>> >>>> >>>> Adam >>>> >>>> On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim < >>>> kabhwan.opensou...@gmail.com> wrote: >>>> >>>>> Bump to get a chance to expose the proposal to wider audiences. >>>>> >>>>> Given that there are not many active contributors/maintainers in area >>>>> Structured Streaming, I'd consider the discussion as "lazy consensus" to >>>>> avoid being stuck. I'll give a final reminder early next week, and move >>>>> forward if there are no outstanding objections. >>>>> >>>>> On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim < >>>>> kabhwan.opensou...@gmail.com> wrote: >>>>> >>>>>> Hi dev, >>>>>> >>>>>> I would like to hear voices about deprecating Trigger.Once, and >>>>>> promoting Trigger.AvailableNow as a replacement [1] in Structured >>>>>> Streaming. >>>>>> (It doesn't mean we remove Trigger.Once now or near future. It >>>>>> probably requires another discussion at some time.) >>>>>> >>>>>> Rationalization: >>>>>> >>>>>> The expected behavior of Trigger.Once is like reading all available >>>>>> data after the last trigger and processing them. This holds true when the >>>>>> last run was gracefully terminated, but there are cases streaming queries >>>>>> to not be terminated gracefully. There is a possibility the last run may >>>>>> write the offset for the new batch before termination, then a new run of >>>>>> Trigger.Once only processes the data which was built in the latest >>>>>> unfinished batch and doesn't process new data. >>>>>> >>>>>> The behavior is not deterministic from the users' point of view, as >>>>>> end users wouldn't know whether the last run wrote the offset or not, >>>>>> unless they look into the query's checkpoint by themselves. >>>>>> >>>>>> While Trigger.AvailableNow came to solve the scalability issue on >>>>>> Trigger.Once, it also ensures that it tries to process all available data >>>>>> at the point of time it is triggered, which consistently works as >>>>>> expected >>>>>> behavior of Trigger.Once. >>>>>> >>>>>> Another issue on Trigger.Once is that it does not trigger a no-data >>>>>> batch immediately. When the watermark is calculated in batch N, it takes >>>>>> effect in batch N + 1. If the query is scheduled to be run per day, you >>>>>> can >>>>>> see the output from the new watermark in the query run the next day. >>>>>> Thanks >>>>>> to the behavior of Trigger.AvailableNow, it handles no-data batch as well >>>>>> before termination of the query. >>>>>> >>>>>> Please review and let us know if you have any feedback or concerns on >>>>>> the proposal. >>>>>> >>>>>> Thanks! >>>>>> Jungtaek Lim >>>>>> >>>>>> 1. https://issues.apache.org/jira/browse/SPARK-36533 >>>>>> >>>>> >>>> >>>> -- >>>> Adam Binford >>>> >>> >> >> -- >> Adam Binford >> >