I think this is OK for this release. However, we need to handle this cleanly and document clearly how it work for next release. I think this is a more problem for CEP. ( adding Suho).
1. Can we add a Jira so we will handle this in the future? 2. Can we update the documents to describe current behaviour? --Srinath On Wed, Jun 15, 2016 at 9:03 PM, Inosh Goonewardena <[email protected]> wrote: > > > On Wed, Jun 15, 2016 at 4:20 PM, Inosh Goonewardena <[email protected]> > wrote: > >> Hi Srinath >> >> >>> DAS deployment might end up collecting data from multiple time zones? >>> >>> >> >>> Is that supported? >>> How is it handled? >>> Does end users have to issue queries in UTC? >>> >> >> We receive epoch time in the timestamp fields of receiving events. Based >> on that timestamp we extract time attributes such as year, month, day, >> hour, etc. and do per <time unit> based aggregations (such as per >> minute/hour/day counts). At the moment we extract those time attributes >> based on the local time zone DAS server running. We have started a >> discussion on this [1] to decide on the possibility of using UTC time, but >> decided against it due to the discrepancies that can occur in window >> boundaries. >> >> For example if the DAS server is running in IST time zone, some example >> time window boundaries are as follows. >> - day window [1461090600000 - 1461177000000] (2016/04/20 12:00:00 AM IST >> - 2016/04/21 12:00:00 AM IST) >> - hour window [1461108600000 - 1461112200000] (2016/04/20 05:00:00 AM IST >> - 2016/04/21 06:00:00 AM IST) >> >> So with the above approach, time zone of the data publisher which we >> collect data from is not considered and windowing is done based on local >> time zones. >> >> Actually, if the data publisher publish data from a different time zone >> and user wants to see window based aggregates (in charts) in based on that >> time zone, there will still be an issue because DAS windows boundaries are >> different than what user expects. >> > > However, this can be avoided by doing windowing based on the time zone > data is collected. Basically, what is explained above is the default > behavior happens if inbuilt temporal functions are used (ex: time:extract() > Siddhi extension [1] and DateTimeUDF [2] in Spark). But it is possible to > extend those extensions and do windowing based on the time zone user > interested in. However, in order support incremental data processing in > such cases, we will need get timezone as an input parameter in incremental > query and set the incremental processing window start time based on that. > > [1] > https://github.com/wso2/siddhi/blob/master/modules/siddhi-extensions/time/src/main/java/org/wso2/siddhi/extension/time/ExtractAttributesFunctionExtension.java > [2] > https://github.com/wso2/shared-analytics/blob/master/components/spark-udf/org.wso2.carbon.analytics.shared.spark.common.udf/src/main/java/org/wso2/carbon/analytics/shared/common/udf/DateTimeUDF.java > > >> >> [1] [Analytics] Using UTC across all temporal analytics functions >> >> >>> >>> >>> thoughts please >>> >>> --Srinath >>> >> >> On Wed, Jun 15, 2016 at 2:11 AM, Mohanadarshan Vivekanandalingam < >> [email protected]> wrote: >> >>> Hi Inosh, >>> >>> Thanks for the fix.. We have incorporated this to analytics-is.. >>> >>> Regards, >>> Mohan >>> >>> >>> On Tue, Jun 14, 2016 at 7:33 PM, Inosh Goonewardena <[email protected]> >>> wrote: >>> >>>> Hi All, >>>> >>>> We made a modification to incremental data processing syntax and >>>> implementation in order to avoid inconsistencies occur with timeWindows. >>>> >>>> For example, with the previous approach if the timeWindow is specified >>>> as 3600 (or 86400) incremental read time is getting set to the hour >>>> starting time (or day starting) time of UTC time. It is due to the the fact >>>> that incremental start time is calculated by performing modulo operation on >>>> a timestamp. In most of the scenarios, analytics is implemented in per >>>> <time unit> basis, i.e., we maintain summary tables for per minute, per >>>> hour, per day, per month. In these summary tables, we do windowing based on >>>> the local timezone of the server which DAS runs [1]. Since incremental time >>>> window start at UTC time (hour or day) inconsistencies happens in aggregate >>>> values of time windows. >>>> >>>> With the new introduced changes, incremental window start time is set >>>> based on local time zone to avoid aforementioned inconsistencies. Syntax is >>>> as follows. >>>> >>>> CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options (tableName >>>> "test", >>>> incrementalProcessing "T1, *WindowUnit*, WindowOffset")*;* >>>> >>>> >>>> WindowUnit concept is similar to the previous concept of TimeWindow, >>>> but in here corresponding window time unit needs to be provided instead of >>>> time in millis. The units supported are SECOND, MINUTE, HOUR, DAY, MONTH, >>>> YEAR. >>>> >>>> For example, lets say server runs in IST time and last processed event >>>> time is 1461111538669 (2016/04/20 05:48:58 PM IST). If the WindowUnit is >>>> set as DAY and WindowOffset is set to 0 in the incremental table, next >>>> script execution will read data starting from 1461090600000 (2016/04/20 >>>> 12:00:00 AM IST). >>>> >>>> [1] [Analytics] Using UTC across all temporal analytics functions >>>> >>>> >>>> >>>> On Fri, Mar 25, 2016 at 10:59 AM, Gihan Anuruddha <[email protected]> >>>> wrote: >>>> >>>>> Actually currently in ESB-analytics also we are using a similar >>>>> mechanism. In there we keep 5 tables for each time unites(second, minute, >>>>> hour, day, month) and we have windowOffset in correspondent to it time >>>>> unite like in minute-60, hour- 3600 etc. We are only storing min, max, sum >>>>> and count values only. So at a given time if we need the average, we >>>>> divide >>>>> the sum with count and get that. >>>>> >>>>> On Fri, Mar 25, 2016 at 8:47 AM, Inosh Goonewardena <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 24, 2016 at 9:56 PM, Sachith Withana <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Inosh, >>>>>>> >>>>>>> We wouldn't have to do that IMO. >>>>>>> >>>>>>> We can persist the total aggregate value upto currentTimeWindow - >>>>>>> WindowOffset, along with the previous time window aggregation meta data >>>>>>> as >>>>>>> well ( count, sum in the average aggregation case). >>>>>>> >>>>>> >>>>>> Yep. That will do. >>>>>> >>>>>>> >>>>>>> The previous total wouldn't be calculated again, it's the last two >>>>>>> time windows ( including the current one) that we need to recalculate >>>>>>> and >>>>>>> add it to the previous total. >>>>>>> >>>>>>> It works almost the same way as the current incremental processing >>>>>>> table, but keeping more meta_data on the aggregation related values. >>>>>>> >>>>>>> @Anjana WDYT? >>>>>>> >>>>>>> Thanks, >>>>>>> Sachith >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 25, 2016 at 7:07 AM, Inosh Goonewardena <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Sachith, >>>>>>>> >>>>>>>> >>>>>>>> *WindowOffset*: >>>>>>>>> >>>>>>>>> Events might arrive late that would belong to a previous processed >>>>>>>>> time window. To account to that, we have added an optional parameter >>>>>>>>> that >>>>>>>>> would allow to >>>>>>>>> process immediately previous time windows as well ( acts like a >>>>>>>>> buffer). >>>>>>>>> ex: If this is set to 1, apart from the to-be-processed data, data >>>>>>>>> related to the previously processed time window will also be taken for >>>>>>>>> processing. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> This means, if window offset is set, already processed data will be >>>>>>>> processed again. How does the aggregate functions works in this case? >>>>>>>> Actually, what is the plan on supporting aggregate functions? Let's >>>>>>>> take >>>>>>>> average function as an example. Are we going to persist sum and count >>>>>>>> values per time windows and re-calculate whole average based on values >>>>>>>> of >>>>>>>> all time windows? Is so, I would guess we can update previously >>>>>>>> processed >>>>>>>> time windows sum and count values. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Mar 24, 2016 at 3:50 AM, Sachith Withana <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Srinath, >>>>>>>>> >>>>>>>>> Please find the comments inline. >>>>>>>>> >>>>>>>>> On Thu, Mar 24, 2016 at 11:39 AM, Srinath Perera <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hi Sachith, Anjana, >>>>>>>>>> >>>>>>>>>> +1 for the backend model. >>>>>>>>>> >>>>>>>>>> Are we handling the case when last run was done, 25 minutes of >>>>>>>>>> data is processed. Basically, next run has to re-run last hour and >>>>>>>>>> update >>>>>>>>>> the value. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yes. It will recalculate for that hour and will update the value. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> When does one hour counting starts? is it from the moment server >>>>>>>>>> starts? That will be probabilistic when you restart. I think we need >>>>>>>>>> to >>>>>>>>>> either start with know place ( midnight) or let user configure it. >>>>>>>>>> >>>>>>>>> >>>>>>>>> In the first run all the data available are processed. >>>>>>>>> After that it calculates the floor of last processed events' >>>>>>>>> timestamp and gets the floor value (timestamp - timestamp%3600), that >>>>>>>>> would >>>>>>>>> be used as the start of the time windows. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I am bit concern about the syntax though. This only works for >>>>>>>>>> very specific type of queries ( that includes aggregate and a group >>>>>>>>>> by). >>>>>>>>>> What happen if user do this with a different query? Can we give >>>>>>>>>> clear error >>>>>>>>>> message? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Currently the error messages are very generic. We will have to >>>>>>>>> work on it to improve those messages. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Sachith >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> --Srinath >>>>>>>>>> >>>>>>>>>> On Mon, Mar 21, 2016 at 5:15 PM, Sachith Withana < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> We are adding incremental processing capability to Spark in DAS. >>>>>>>>>>> As the first stage, we added time slicing to Spark execution. >>>>>>>>>>> >>>>>>>>>>> Here's a quick introduction into that. >>>>>>>>>>> >>>>>>>>>>> *Execution*: >>>>>>>>>>> >>>>>>>>>>> In the first run of the script, it will process all the data in >>>>>>>>>>> the given table and store the last processed event timestamp. >>>>>>>>>>> Then from the next run onwards it would start processing >>>>>>>>>>> starting from that stored timestamp. >>>>>>>>>>> >>>>>>>>>>> Until the query that contains the data processing part, >>>>>>>>>>> completes, last processed event timestamp would not be overridden >>>>>>>>>>> with the >>>>>>>>>>> new value. >>>>>>>>>>> This is to ensure that the data processing for the query >>>>>>>>>>> wouldn't have to be done again, if the whole query fails. >>>>>>>>>>> This is ensured by adding a commit query after the main query. >>>>>>>>>>> Refer to the Syntax section for the example. >>>>>>>>>>> >>>>>>>>>>> *Syntax*: >>>>>>>>>>> >>>>>>>>>>> In the spark script, incremental processing support has to be >>>>>>>>>>> specified per table, this would happen in the create temporary >>>>>>>>>>> table line. >>>>>>>>>>> >>>>>>>>>>> ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options >>>>>>>>>>> (tableName "test", >>>>>>>>>>> *incrementalProcessing "T1,3600");* >>>>>>>>>>> >>>>>>>>>>> INSERT INTO T2 SELECT username, age GROUP BY age FROM T1; >>>>>>>>>>> >>>>>>>>>>> INC_TABLE_COMMIT T1; >>>>>>>>>>> >>>>>>>>>>> The last line is where it ensures the processing took place >>>>>>>>>>> successfully and replaces the lastProcessed timestamp with the new >>>>>>>>>>> one. >>>>>>>>>>> >>>>>>>>>>> *TimeWindow*: >>>>>>>>>>> >>>>>>>>>>> To do the incremental processing, the user has to provide the >>>>>>>>>>> time window per which the data would be processed. >>>>>>>>>>> In the above example. the data would be summarized in *1 hour *time >>>>>>>>>>> windows. >>>>>>>>>>> >>>>>>>>>>> *WindowOffset*: >>>>>>>>>>> >>>>>>>>>>> Events might arrive late that would belong to a previous >>>>>>>>>>> processed time window. To account to that, we have added an optional >>>>>>>>>>> parameter that would allow to >>>>>>>>>>> process immediately previous time windows as well ( acts like a >>>>>>>>>>> buffer). >>>>>>>>>>> ex: If this is set to 1, apart from the to-be-processed data, >>>>>>>>>>> data related to the previously processed time window will also be >>>>>>>>>>> taken for >>>>>>>>>>> processing. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Limitations*: >>>>>>>>>>> >>>>>>>>>>> Currently, multiple time windows cannot be specified per >>>>>>>>>>> temporary table in the same script. >>>>>>>>>>> It would have to be done using different temporary tables. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Future Improvements:* >>>>>>>>>>> - Add aggregation function support for incremental processing >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Sachith >>>>>>>>>>> -- >>>>>>>>>>> Sachith Withana >>>>>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>>>>>>>> E-mail: sachith AT wso2.com >>>>>>>>>>> M: +94715518127 >>>>>>>>>>> Linked-In: <http://goog_416592669> >>>>>>>>>>> https://lk.linkedin.com/in/sachithwithana >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> ============================ >>>>>>>>>> Srinath Perera, Ph.D. >>>>>>>>>> http://people.apache.org/~hemapani/ >>>>>>>>>> http://srinathsview.blogspot.com/ >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sachith Withana >>>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>>>>>> E-mail: sachith AT wso2.com >>>>>>>>> M: +94715518127 >>>>>>>>> Linked-In: <http://goog_416592669> >>>>>>>>> https://lk.linkedin.com/in/sachithwithana >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Architecture mailing list >>>>>>>>> [email protected] >>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks & Regards, >>>>>>>> >>>>>>>> Inosh Goonewardena >>>>>>>> Associate Technical Lead- WSO2 Inc. >>>>>>>> Mobile: +94779966317 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sachith Withana >>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>>>> E-mail: sachith AT wso2.com >>>>>>> M: +94715518127 >>>>>>> Linked-In: <http://goog_416592669> >>>>>>> https://lk.linkedin.com/in/sachithwithana >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks & Regards, >>>>>> >>>>>> Inosh Goonewardena >>>>>> Associate Technical Lead- WSO2 Inc. >>>>>> Mobile: +94779966317 >>>>>> >>>>>> _______________________________________________ >>>>>> Architecture mailing list >>>>>> [email protected] >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> W.G. Gihan Anuruddha >>>>> Senior Software Engineer | WSO2, Inc. >>>>> M: +94772272595 >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks & Regards, >>>> >>>> Inosh Goonewardena >>>> Associate Technical Lead- WSO2 Inc. >>>> Mobile: +94779966317 >>>> >>> >>> >>> >>> -- >>> *V. Mohanadarshan* >>> *Associate Tech Lead,* >>> *Data Technologies Team,* >>> *WSO2, Inc. http://wso2.com <http://wso2.com> * >>> *lean.enterprise.middleware.* >>> >>> email: [email protected] >>> phone:(+94) 771117673 >>> >> >> >> >> -- >> Thanks & Regards, >> >> Inosh Goonewardena >> Associate Technical Lead- WSO2 Inc. >> Mobile: +94779966317 >> > > > > -- > Thanks & Regards, > > Inosh Goonewardena > Associate Technical Lead- WSO2 Inc. > Mobile: +94779966317 > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- ============================ Blog: http://srinathsview.blogspot.com twitter:@srinath_perera Site: http://home.apache.org/~hemapani/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
