No matter how the problem is structured, computing 30 day aggregations for every 10 minute window requires storing at least 30day/10min = ~4000 sub-aggregations. In Beam, the elements themselves are not stored in every window, only the intermediate aggregates.
I second Luke's suggestion to try it out and see if this is indeed a prohibitive bottleneck. On Tue, Oct 29, 2019 at 1:29 PM Luke Cwik <[email protected]> wrote: > > You should first try the obvious answer of using a sliding window of 30 days > every 10 minutes before you try the 60 days every 30 days. > Beam has some optimizations which will assign a value to multiple windows and > only process that value once even if its in many windows. If that doesn't > perform well, then come back to dev@ and look to optimize. > > On Tue, Oct 29, 2019 at 1:22 PM Aaron Dixon <[email protected]> wrote: >> >> Hi I am new to Beam. >> >> I would like to accumulate data over 30 day period and perform a running >> aggregation over this data, say every 10 minutes. >> >> I could use a sliding window of 30 days every 10 minutes (triggering at end >> of window) but this seems grossly inefficient (both in terms of # of windows >> at play and # of events duplicated across these windows). >> >> A more efficient strategy seems to be to use a sliding window of 60 days >> every 30 days -- triggering every 10 minutes -- so that I'm guaranteed to >> have 30 days worth of data aggregated/combined in at least one of the 2 >> at-play sliding windows. >> >> The last piece of this puzzle however would be to do a final global >> aggregation over only the keys from the latest trigger of the earlier >> sliding window. >> >> But Beam does not seem to offer a way to orchestrate this. Even though this >> seems like it would be a pretty common or fundamental ask. >> >> One thought I had was to re-window in a way that would isolate keys >> triggered at the same time, in the same window but I don't see any contracts >> from Beam that would allow an approach like that. >> >> What am I missing? >> >>
