Re: [Architecture] Incremental Processing Support in DAS

Inosh Goonewardena Thu, 24 Mar 2016 20:19:10 -0700

On Thu, Mar 24, 2016 at 9:56 PM, Sachith Withana <[email protected]> wrote:


> Hi Inosh,
>
> We wouldn't have to do that IMO.
>
> We can persist the total aggregate value upto currentTimeWindow -
> WindowOffset, along with the previous time window aggregation meta data as
> well ( count, sum in the average aggregation case).
>

Yep. That will do.

>
> The previous total wouldn't be calculated again, it's the last two time
> windows ( including the current one) that we need to recalculate and add it
> to the previous total.
>
> It works almost the same way as the current incremental processing table,
> but keeping more meta_data on the aggregation related values.
>
> @Anjana WDYT?
>
> Thanks,
> Sachith
>
>
> On Fri, Mar 25, 2016 at 7:07 AM, Inosh Goonewardena <[email protected]>
> wrote:
>
>> Hi Sachith,
>>
>>
>> *WindowOffset*:
>>>
>>> Events might arrive late that would belong to a previous processed time
>>> window. To account to that, we have added an optional parameter that would
>>> allow to
>>> process immediately previous time windows as well ( acts like a buffer).
>>> ex: If this is set to 1, apart from the to-be-processed data, data
>>> related to the previously processed time window will also be taken for
>>> processing.
>>>
>>
>>
>> This means, if window offset is set, already processed data will be
>> processed again. How does the aggregate functions works in this case?
>> Actually, what is the plan on supporting aggregate functions? Let's take
>> average function as an example. Are we going to persist sum and count
>> values per time windows and re-calculate whole average based on values of
>> all time windows? Is so, I would guess we can update previously processed
>> time windows sum and count values.
>>
>>
>> On Thu, Mar 24, 2016 at 3:50 AM, Sachith Withana <[email protected]>
>> wrote:
>>
>>> Hi Srinath,
>>>
>>> Please find the comments inline.
>>>
>>> On Thu, Mar 24, 2016 at 11:39 AM, Srinath Perera <[email protected]>
>>> wrote:
>>>
>>>> Hi Sachith, Anjana,
>>>>
>>>> +1 for the backend model.
>>>>
>>>> Are we handling the case when last run was done, 25 minutes of data is
>>>> processed. Basically, next run has to re-run last hour and update the 
>>>> value.
>>>>
>>>
>>> Yes. It will recalculate for that hour and will update the value.
>>>
>>>
>>>>
>>>> When does one hour counting starts? is it from the moment server
>>>> starts? That will be probabilistic when you restart. I think we need to
>>>> either start with know place ( midnight) or let user configure it.
>>>>
>>>
>>> In the first run all the data available are processed.
>>> After that it calculates the floor of last processed events' timestamp
>>> and gets the floor value (timestamp - timestamp%3600), that would be used
>>> as the start of the time windows.
>>>
>>>>
>>>> I am bit concern about the syntax though. This only works for very
>>>> specific type of queries ( that includes aggregate and a group by). What
>>>> happen if user do this with a different query? Can we give clear error
>>>> message?
>>>>
>>>
>>> Currently the error messages are very generic. We will have to work on
>>> it to improve those messages.
>>>
>>> Thanks,
>>> Sachith
>>>
>>>
>>>>
>>>> --Srinath
>>>>
>>>> On Mon, Mar 21, 2016 at 5:15 PM, Sachith Withana <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We are adding incremental processing capability to Spark in DAS.
>>>>> As the first stage, we added time slicing to Spark execution.
>>>>>
>>>>> Here's a quick introduction into that.
>>>>>
>>>>> *Execution*:
>>>>>
>>>>> In the first run of the script, it will process all the data in the
>>>>> given table and store the last processed event timestamp.
>>>>> Then from the next run onwards it would start processing starting from
>>>>> that stored timestamp.
>>>>>
>>>>> Until the query that contains the data processing part, completes,
>>>>> last processed event timestamp would not be overridden with the new value.
>>>>> This is to ensure that the data processing for the query wouldn't have
>>>>> to be done again, if the whole query fails.
>>>>> This is ensured by adding a commit query after the main query.
>>>>> Refer to the Syntax section for the example.
>>>>>
>>>>> *Syntax*:
>>>>>
>>>>> In the spark script, incremental processing support has to be
>>>>> specified per table, this would happen in the create temporary table line.
>>>>>
>>>>> ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options (tableName
>>>>> "test",
>>>>> *incrementalProcessing "T1,3600");*
>>>>>
>>>>> INSERT INTO T2 SELECT username, age GROUP BY age FROM T1;
>>>>>
>>>>> INC_TABLE_COMMIT T1;
>>>>>
>>>>> The last line is where it ensures the processing took place
>>>>> successfully and replaces the lastProcessed timestamp with the new one.
>>>>>
>>>>> *TimeWindow*:
>>>>>
>>>>> To do the incremental processing, the user has to provide the time
>>>>> window per which the data would be processed.
>>>>> In the above example. the data would be summarized in *1 hour *time
>>>>> windows.
>>>>>
>>>>> *WindowOffset*:
>>>>>
>>>>> Events might arrive late that would belong to a previous processed
>>>>> time window. To account to that, we have added an optional parameter that
>>>>> would allow to
>>>>> process immediately previous time windows as well ( acts like a
>>>>> buffer).
>>>>> ex: If this is set to 1, apart from the to-be-processed data, data
>>>>> related to the previously processed time window will also be taken for
>>>>> processing.
>>>>>
>>>>>
>>>>> *Limitations*:
>>>>>
>>>>> Currently, multiple time windows cannot be specified per temporary
>>>>> table in the same script.
>>>>> It would have to be done using different temporary tables.
>>>>>
>>>>>
>>>>>
>>>>> *Future Improvements:*
>>>>> - Add aggregation function support for incremental processing
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Sachith
>>>>> --
>>>>> Sachith Withana
>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>> E-mail: sachith AT wso2.com
>>>>> M: +94715518127
>>>>> Linked-In: <http://goog_416592669>
>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Srinath Perera, Ph.D.
>>>>    http://people.apache.org/~hemapani/
>>>>    http://srinathsview.blogspot.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Sachith Withana
>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>> E-mail: sachith AT wso2.com
>>> M: +94715518127
>>> Linked-In: <http://goog_416592669>
>>> https://lk.linkedin.com/in/sachithwithana
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Inosh Goonewardena
>> Associate Technical Lead- WSO2 Inc.
>> Mobile: +94779966317
>>
>
>
>
> --
> Sachith Withana
> Software Engineer; WSO2 Inc.; http://wso2.com
> E-mail: sachith AT wso2.com
> M: +94715518127
> Linked-In: <http://goog_416592669>
> https://lk.linkedin.com/in/sachithwithana
>



-- 
Thanks & Regards,

Inosh Goonewardena
Associate Technical Lead- WSO2 Inc.
Mobile: +94779966317

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Incremental Processing Support in DAS

Reply via email to