This may be possible with a custom WindowFn. Where is the configuration of
what aggregations to do coming from?

On Wed, Sep 21, 2016 at 11:27 PM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:

> Attaching the Image.
>
>
> ​
>
> Regards
> Sumit Chawla
>
>
> On Wed, Sep 21, 2016 at 11:24 PM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
>> Hi All
>>
>> I am trying to code a solution for following scenarios.
>>
>> 1.  I have a stream of Tuples with multiple numeric fields (e.g. A, B, C,
>> D, E ... etc )
>> 2.  I want the ability to do different Windowing and Aggregation on each
>> field or a group of fields in the Tuple.  e.g. Sum A over a Period of 2
>> minutes, Avg B over a period of 3 minutes,  Sum of C grouped by D over a
>> period of 15 minutes
>> 3.  *These window requirements can be added by user at runtime*.  My
>> pipeline should be able to compute a new aggregation at runtime.
>> 4.  Plan to support only simple aggregation windows like SUM, AVG, MAX,
>> MIN, COUNT etc.
>>
>>
>> As i understand in BEAM pipelines ( with Flink Runner), the DAG of
>> computations cannot be altered once the pipeline is deployed.  I am trying
>> to see how can i support above use case.  I would love to hear your
>> feedback on this, and suggestions on doing it in a completely different
>> way.
>>
>> *My Design:*
>>
>> 1.   Create 1 minute buckets per Field or Group of Fields and compute
>> basic aggregations for bucket.  e.g.  Buckets are highlighted in Yellow
>> here.  For each field i calculate [SUM, COUNT, MAX, MIN] in the bucket.  (
>> Bucket size of 1 minute is defined for simplicity, and would limit the
>> minimum window size to 1 minute)
>>
>> 2.  Downstream process these buckets, and compute the user defined
>> aggregations.  Following diagram depicts Tumbling Window computations.  The
>> Aggregation functions in GREEN are just NATIVE functions consuming
>> different buckets, and doing aggregations on top of these buckets.
>>
>>
>>
>>
>> ​
>> ​
>>
>> *P.S.*
>>
>> * Some of the design choices that i have decided not to go for are:*
>>
>> 1.  Multiple Pipelines for doing computation.  One master pipeline does
>> grouping, and sends to a different topic based on user configured window
>> size. (e.g. topic_window_by_5_min, topic_window_by_15_min), and have a
>> different pipeline consume each topic.
>>
>> 2.  Single pipeline doing all the business with predefined Windows
>> defined for Downstream processing. e.g. 5, 15, 30, 60 minute windows will
>> be defined which will consume from different Side Inputs.  User is only
>> allowed only to select these Window sizes.  Upstream Group By operator will
>> route to the data to different Window Function based on user configuration.
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Reply via email to