Re: [Architecture] Incremental Processing Support in DAS

Srinath Perera Wed, 15 Jun 2016 21:07:28 -0700

I think this is OK for this release.

However, we need to handle this cleanly and document clearly how it work
for next release. I think this is a more problem for CEP. ( adding Suho).



   1. Can we add a Jira so we will handle this in the future?
   2. Can we update the documents to describe current behaviour?

--Srinath




On Wed, Jun 15, 2016 at 9:03 PM, Inosh Goonewardena <[email protected]> wrote:

>
>
> On Wed, Jun 15, 2016 at 4:20 PM, Inosh Goonewardena <[email protected]>
> wrote:
>
>> Hi Srinath
>>
>>
>>> DAS deployment might end up collecting data from multiple time zones?
>>>
>>>
>>
>>> Is that supported?
>>> How is it handled?
>>> Does end users have to issue queries in UTC?
>>>
>>
>> We receive epoch time in the timestamp fields of receiving events. Based
>> on that timestamp we extract time attributes such as year, month, day,
>> hour, etc. and do per <time unit> based aggregations (such as per
>> minute/hour/day counts). At the moment we extract those time attributes
>> based on the local time zone DAS server running. We have started a
>> discussion on this [1] to decide on the possibility of using UTC time, but
>> decided against it due to the discrepancies that can occur in window
>> boundaries.
>>
>> For example if the DAS server is running in IST time zone, some example
>> time window boundaries are as follows.
>> - day window [1461090600000 - 1461177000000] (2016/04/20 12:00:00 AM IST
>> - 2016/04/21 12:00:00 AM IST)
>> - hour window [1461108600000 - 1461112200000] (2016/04/20 05:00:00 AM IST
>> - 2016/04/21 06:00:00 AM IST)
>>
>> So with the above approach, time zone of the data publisher which we
>> collect data from is not considered and windowing is done based on local
>> time zones.
>>
>> Actually, if the data publisher publish data from a different time zone
>> and user wants to see window based aggregates (in charts) in based on that
>> time zone, there will still be an issue because DAS windows boundaries are
>> different than what user expects.
>>
>
> However, this can be avoided by doing windowing based on the time zone
> data is collected. Basically, what is explained above is the default
> behavior happens if inbuilt temporal functions are used (ex: time:extract()
> Siddhi extension [1] and DateTimeUDF [2] in Spark). But it is possible to
> extend those extensions and do windowing based on the time zone user
> interested in. However, in order support incremental data processing in
> such cases, we will need get timezone as an input parameter in incremental
> query and set the incremental processing window start time based on that.
>
> [1]
> https://github.com/wso2/siddhi/blob/master/modules/siddhi-extensions/time/src/main/java/org/wso2/siddhi/extension/time/ExtractAttributesFunctionExtension.java
> [2]
> https://github.com/wso2/shared-analytics/blob/master/components/spark-udf/org.wso2.carbon.analytics.shared.spark.common.udf/src/main/java/org/wso2/carbon/analytics/shared/common/udf/DateTimeUDF.java
>
>
>>
>> [1]  [Analytics] Using UTC across all temporal analytics functions
>>
>>
>>>
>>>
>>> thoughts please
>>>
>>> --Srinath
>>>
>>
>> On Wed, Jun 15, 2016 at 2:11 AM, Mohanadarshan Vivekanandalingam <
>> [email protected]> wrote:
>>
>>> Hi Inosh,
>>>
>>> Thanks for the fix.. We have incorporated this to analytics-is..
>>>
>>> Regards,
>>> Mohan
>>>
>>>
>>> On Tue, Jun 14, 2016 at 7:33 PM, Inosh Goonewardena <[email protected]>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We made a modification to incremental data processing syntax and
>>>> implementation in order to avoid inconsistencies occur with timeWindows.
>>>>
>>>> For example, with the previous approach if the timeWindow is specified
>>>> as 3600 (or 86400) incremental read time is getting set to the hour
>>>> starting time (or day starting) time of UTC time. It is due to the the fact
>>>> that incremental start time is calculated by performing modulo operation on
>>>> a timestamp. In most of the scenarios, analytics is implemented in per
>>>> <time unit> basis, i.e., we maintain summary tables for per minute, per
>>>> hour, per day, per month. In these summary tables, we do windowing based on
>>>> the local timezone of the server which DAS runs [1]. Since incremental time
>>>> window start at UTC time (hour or day) inconsistencies happens in aggregate
>>>> values of time windows.
>>>>
>>>> With the new introduced changes, incremental window start time is set
>>>> based on local time zone to avoid aforementioned inconsistencies. Syntax is
>>>> as follows.
>>>>
>>>> CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options (tableName
>>>> "test",
>>>> incrementalProcessing "T1, *WindowUnit*, WindowOffset")*;*
>>>>
>>>>
>>>> WindowUnit concept is similar to the previous concept of TimeWindow,
>>>> but in here corresponding window time unit needs to be provided instead of
>>>> time in millis. The units supported are SECOND, MINUTE, HOUR, DAY, MONTH,
>>>> YEAR.
>>>>
>>>> For example, lets say server runs in IST time and last processed event
>>>> time is 1461111538669 (2016/04/20 05:48:58 PM IST). If the WindowUnit is
>>>> set as DAY and WindowOffset is set to 0 in the incremental table, next
>>>> script execution will read data starting from 1461090600000 (2016/04/20
>>>> 12:00:00 AM IST).
>>>>
>>>> [1] [Analytics] Using UTC across all temporal analytics functions
>>>>
>>>>
>>>>
>>>> On Fri, Mar 25, 2016 at 10:59 AM, Gihan Anuruddha <[email protected]>
>>>> wrote:
>>>>
>>>>> Actually currently in ESB-analytics also we are using a similar
>>>>> mechanism. In there we keep 5 tables for each time unites(second, minute,
>>>>> hour, day, month) and we have windowOffset in correspondent to it time
>>>>> unite like in minute-60, hour- 3600 etc. We are only storing min, max, sum
>>>>> and count values only. So at a given time if we need the average, we 
>>>>> divide
>>>>> the sum with count and get that.
>>>>>
>>>>> On Fri, Mar 25, 2016 at 8:47 AM, Inosh Goonewardena <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 24, 2016 at 9:56 PM, Sachith Withana <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Inosh,
>>>>>>>
>>>>>>> We wouldn't have to do that IMO.
>>>>>>>
>>>>>>> We can persist the total aggregate value upto currentTimeWindow -
>>>>>>> WindowOffset, along with the previous time window aggregation meta data 
>>>>>>> as
>>>>>>> well ( count, sum in the average aggregation case).
>>>>>>>
>>>>>>
>>>>>> Yep. That will do.
>>>>>>
>>>>>>>
>>>>>>> The previous total wouldn't be calculated again, it's the last two
>>>>>>> time windows ( including the current one) that we need to recalculate 
>>>>>>> and
>>>>>>> add it to the previous total.
>>>>>>>
>>>>>>> It works almost the same way as the current incremental processing
>>>>>>> table, but keeping more meta_data on the aggregation related values.
>>>>>>>
>>>>>>> @Anjana WDYT?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sachith
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 25, 2016 at 7:07 AM, Inosh Goonewardena <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Sachith,
>>>>>>>>
>>>>>>>>
>>>>>>>> *WindowOffset*:
>>>>>>>>>
>>>>>>>>> Events might arrive late that would belong to a previous processed
>>>>>>>>> time window. To account to that, we have added an optional parameter 
>>>>>>>>> that
>>>>>>>>> would allow to
>>>>>>>>> process immediately previous time windows as well ( acts like a
>>>>>>>>> buffer).
>>>>>>>>> ex: If this is set to 1, apart from the to-be-processed data, data
>>>>>>>>> related to the previously processed time window will also be taken for
>>>>>>>>> processing.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This means, if window offset is set, already processed data will be
>>>>>>>> processed again. How does the aggregate functions works in this case?
>>>>>>>> Actually, what is the plan on supporting aggregate functions? Let's 
>>>>>>>> take
>>>>>>>> average function as an example. Are we going to persist sum and count
>>>>>>>> values per time windows and re-calculate whole average based on values 
>>>>>>>> of
>>>>>>>> all time windows? Is so, I would guess we can update previously 
>>>>>>>> processed
>>>>>>>> time windows sum and count values.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 24, 2016 at 3:50 AM, Sachith Withana <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Srinath,
>>>>>>>>>
>>>>>>>>> Please find the comments inline.
>>>>>>>>>
>>>>>>>>> On Thu, Mar 24, 2016 at 11:39 AM, Srinath Perera <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sachith, Anjana,
>>>>>>>>>>
>>>>>>>>>> +1 for the backend model.
>>>>>>>>>>
>>>>>>>>>> Are we handling the case when last run was done, 25 minutes of
>>>>>>>>>> data is processed. Basically, next run has to re-run last hour and 
>>>>>>>>>> update
>>>>>>>>>> the value.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes. It will recalculate for that hour and will update the value.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When does one hour counting starts? is it from the moment server
>>>>>>>>>> starts? That will be probabilistic when you restart. I think we need 
>>>>>>>>>> to
>>>>>>>>>> either start with know place ( midnight) or let user configure it.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the first run all the data available are processed.
>>>>>>>>> After that it calculates the floor of last processed events'
>>>>>>>>> timestamp and gets the floor value (timestamp - timestamp%3600), that 
>>>>>>>>> would
>>>>>>>>> be used as the start of the time windows.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am bit concern about the syntax though. This only works for
>>>>>>>>>> very specific type of queries ( that includes aggregate and a group 
>>>>>>>>>> by).
>>>>>>>>>> What happen if user do this with a different query? Can we give 
>>>>>>>>>> clear error
>>>>>>>>>> message?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Currently the error messages are very generic. We will have to
>>>>>>>>> work on it to improve those messages.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sachith
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --Srinath
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 21, 2016 at 5:15 PM, Sachith Withana <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> We are adding incremental processing capability to Spark in DAS.
>>>>>>>>>>> As the first stage, we added time slicing to Spark execution.
>>>>>>>>>>>
>>>>>>>>>>> Here's a quick introduction into that.
>>>>>>>>>>>
>>>>>>>>>>> *Execution*:
>>>>>>>>>>>
>>>>>>>>>>> In the first run of the script, it will process all the data in
>>>>>>>>>>> the given table and store the last processed event timestamp.
>>>>>>>>>>> Then from the next run onwards it would start processing
>>>>>>>>>>> starting from that stored timestamp.
>>>>>>>>>>>
>>>>>>>>>>> Until the query that contains the data processing part,
>>>>>>>>>>> completes, last processed event timestamp would not be overridden 
>>>>>>>>>>> with the
>>>>>>>>>>> new value.
>>>>>>>>>>> This is to ensure that the data processing for the query
>>>>>>>>>>> wouldn't have to be done again, if the whole query fails.
>>>>>>>>>>> This is ensured by adding a commit query after the main query.
>>>>>>>>>>> Refer to the Syntax section for the example.
>>>>>>>>>>>
>>>>>>>>>>> *Syntax*:
>>>>>>>>>>>
>>>>>>>>>>> In the spark script, incremental processing support has to be
>>>>>>>>>>> specified per table, this would happen in the create temporary 
>>>>>>>>>>> table line.
>>>>>>>>>>>
>>>>>>>>>>> ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options
>>>>>>>>>>> (tableName "test",
>>>>>>>>>>> *incrementalProcessing "T1,3600");*
>>>>>>>>>>>
>>>>>>>>>>> INSERT INTO T2 SELECT username, age GROUP BY age FROM T1;
>>>>>>>>>>>
>>>>>>>>>>> INC_TABLE_COMMIT T1;
>>>>>>>>>>>
>>>>>>>>>>> The last line is where it ensures the processing took place
>>>>>>>>>>> successfully and replaces the lastProcessed timestamp with the new 
>>>>>>>>>>> one.
>>>>>>>>>>>
>>>>>>>>>>> *TimeWindow*:
>>>>>>>>>>>
>>>>>>>>>>> To do the incremental processing, the user has to provide the
>>>>>>>>>>> time window per which the data would be processed.
>>>>>>>>>>> In the above example. the data would be summarized in *1 hour *time
>>>>>>>>>>> windows.
>>>>>>>>>>>
>>>>>>>>>>> *WindowOffset*:
>>>>>>>>>>>
>>>>>>>>>>> Events might arrive late that would belong to a previous
>>>>>>>>>>> processed time window. To account to that, we have added an optional
>>>>>>>>>>> parameter that would allow to
>>>>>>>>>>> process immediately previous time windows as well ( acts like a
>>>>>>>>>>> buffer).
>>>>>>>>>>> ex: If this is set to 1, apart from the to-be-processed data,
>>>>>>>>>>> data related to the previously processed time window will also be 
>>>>>>>>>>> taken for
>>>>>>>>>>> processing.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Limitations*:
>>>>>>>>>>>
>>>>>>>>>>> Currently, multiple time windows cannot be specified per
>>>>>>>>>>> temporary table in the same script.
>>>>>>>>>>> It would have to be done using different temporary tables.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Future Improvements:*
>>>>>>>>>>> - Add aggregation function support for incremental processing
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sachith
>>>>>>>>>>> --
>>>>>>>>>>> Sachith Withana
>>>>>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>>>>>>>> E-mail: sachith AT wso2.com
>>>>>>>>>>> M: +94715518127
>>>>>>>>>>> Linked-In: <http://goog_416592669>
>>>>>>>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> ============================
>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sachith Withana
>>>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>>>>>> E-mail: sachith AT wso2.com
>>>>>>>>> M: +94715518127
>>>>>>>>> Linked-In: <http://goog_416592669>
>>>>>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Architecture mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>>
>>>>>>>> Inosh Goonewardena
>>>>>>>> Associate Technical Lead- WSO2 Inc.
>>>>>>>> Mobile: +94779966317
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sachith Withana
>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>>>> E-mail: sachith AT wso2.com
>>>>>>> M: +94715518127
>>>>>>> Linked-In: <http://goog_416592669>
>>>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>>
>>>>>> Inosh Goonewardena
>>>>>> Associate Technical Lead- WSO2 Inc.
>>>>>> Mobile: +94779966317
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> W.G. Gihan Anuruddha
>>>>> Senior Software Engineer | WSO2, Inc.
>>>>> M: +94772272595
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>>
>>>> Inosh Goonewardena
>>>> Associate Technical Lead- WSO2 Inc.
>>>> Mobile: +94779966317
>>>>
>>>
>>>
>>>
>>> --
>>> *V. Mohanadarshan*
>>> *Associate Tech Lead,*
>>> *Data Technologies Team,*
>>> *WSO2, Inc. http://wso2.com <http://wso2.com> *
>>> *lean.enterprise.middleware.*
>>>
>>> email: [email protected]
>>> phone:(+94) 771117673
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Inosh Goonewardena
>> Associate Technical Lead- WSO2 Inc.
>> Mobile: +94779966317
>>
>
>
>
> --
> Thanks & Regards,
>
> Inosh Goonewardena
> Associate Technical Lead- WSO2 Inc.
> Mobile: +94779966317
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
============================
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://home.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Incremental Processing Support in DAS

Reply via email to