This may be possible with a custom WindowFn. Where is the configuration of what aggregations to do coming from?
On Wed, Sep 21, 2016 at 11:27 PM, Chawla,Sumit <[email protected]> wrote: > Attaching the Image. > > > > > Regards > Sumit Chawla > > > On Wed, Sep 21, 2016 at 11:24 PM, Chawla,Sumit <[email protected]> > wrote: > >> Hi All >> >> I am trying to code a solution for following scenarios. >> >> 1. I have a stream of Tuples with multiple numeric fields (e.g. A, B, C, >> D, E ... etc ) >> 2. I want the ability to do different Windowing and Aggregation on each >> field or a group of fields in the Tuple. e.g. Sum A over a Period of 2 >> minutes, Avg B over a period of 3 minutes, Sum of C grouped by D over a >> period of 15 minutes >> 3. *These window requirements can be added by user at runtime*. My >> pipeline should be able to compute a new aggregation at runtime. >> 4. Plan to support only simple aggregation windows like SUM, AVG, MAX, >> MIN, COUNT etc. >> >> >> As i understand in BEAM pipelines ( with Flink Runner), the DAG of >> computations cannot be altered once the pipeline is deployed. I am trying >> to see how can i support above use case. I would love to hear your >> feedback on this, and suggestions on doing it in a completely different >> way. >> >> *My Design:* >> >> 1. Create 1 minute buckets per Field or Group of Fields and compute >> basic aggregations for bucket. e.g. Buckets are highlighted in Yellow >> here. For each field i calculate [SUM, COUNT, MAX, MIN] in the bucket. ( >> Bucket size of 1 minute is defined for simplicity, and would limit the >> minimum window size to 1 minute) >> >> 2. Downstream process these buckets, and compute the user defined >> aggregations. Following diagram depicts Tumbling Window computations. The >> Aggregation functions in GREEN are just NATIVE functions consuming >> different buckets, and doing aggregations on top of these buckets. >> >> >> >> >> >> >> >> *P.S.* >> >> * Some of the design choices that i have decided not to go for are:* >> >> 1. Multiple Pipelines for doing computation. One master pipeline does >> grouping, and sends to a different topic based on user configured window >> size. (e.g. topic_window_by_5_min, topic_window_by_15_min), and have a >> different pipeline consume each topic. >> >> 2. Single pipeline doing all the business with predefined Windows >> defined for Downstream processing. e.g. 5, 15, 30, 60 minute windows will >> be defined which will consume from different Side Inputs. User is only >> allowed only to select these Window sizes. Upstream Group By operator will >> route to the data to different Window Function based on user configuration. >> >> >> >> Regards >> Sumit Chawla >> >> >
