Hi Sachith,
*WindowOffset*: > > Events might arrive late that would belong to a previous processed time > window. To account to that, we have added an optional parameter that would > allow to > process immediately previous time windows as well ( acts like a buffer). > ex: If this is set to 1, apart from the to-be-processed data, data related > to the previously processed time window will also be taken for processing. > This means, if window offset is set, already processed data will be processed again. How does the aggregate functions works in this case? Actually, what is the plan on supporting aggregate functions? Let's take average function as an example. Are we going to persist sum and count values per time windows and re-calculate whole average based on values of all time windows? Is so, I would guess we can update previously processed time windows sum and count values. On Thu, Mar 24, 2016 at 3:50 AM, Sachith Withana <[email protected]> wrote: > Hi Srinath, > > Please find the comments inline. > > On Thu, Mar 24, 2016 at 11:39 AM, Srinath Perera <[email protected]> wrote: > >> Hi Sachith, Anjana, >> >> +1 for the backend model. >> >> Are we handling the case when last run was done, 25 minutes of data is >> processed. Basically, next run has to re-run last hour and update the value. >> > > Yes. It will recalculate for that hour and will update the value. > > >> >> When does one hour counting starts? is it from the moment server starts? >> That will be probabilistic when you restart. I think we need to either >> start with know place ( midnight) or let user configure it. >> > > In the first run all the data available are processed. > After that it calculates the floor of last processed events' timestamp and > gets the floor value (timestamp - timestamp%3600), that would be used as > the start of the time windows. > >> >> I am bit concern about the syntax though. This only works for very >> specific type of queries ( that includes aggregate and a group by). What >> happen if user do this with a different query? Can we give clear error >> message? >> > > Currently the error messages are very generic. We will have to work on it > to improve those messages. > > Thanks, > Sachith > > >> >> --Srinath >> >> On Mon, Mar 21, 2016 at 5:15 PM, Sachith Withana <[email protected]> >> wrote: >> >>> Hi all, >>> >>> We are adding incremental processing capability to Spark in DAS. >>> As the first stage, we added time slicing to Spark execution. >>> >>> Here's a quick introduction into that. >>> >>> *Execution*: >>> >>> In the first run of the script, it will process all the data in the >>> given table and store the last processed event timestamp. >>> Then from the next run onwards it would start processing starting from >>> that stored timestamp. >>> >>> Until the query that contains the data processing part, completes, last >>> processed event timestamp would not be overridden with the new value. >>> This is to ensure that the data processing for the query wouldn't have >>> to be done again, if the whole query fails. >>> This is ensured by adding a commit query after the main query. >>> Refer to the Syntax section for the example. >>> >>> *Syntax*: >>> >>> In the spark script, incremental processing support has to be specified >>> per table, this would happen in the create temporary table line. >>> >>> ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options (tableName >>> "test", >>> *incrementalProcessing "T1,3600");* >>> >>> INSERT INTO T2 SELECT username, age GROUP BY age FROM T1; >>> >>> INC_TABLE_COMMIT T1; >>> >>> The last line is where it ensures the processing took place successfully >>> and replaces the lastProcessed timestamp with the new one. >>> >>> *TimeWindow*: >>> >>> To do the incremental processing, the user has to provide the time >>> window per which the data would be processed. >>> In the above example. the data would be summarized in *1 hour *time >>> windows. >>> >>> *WindowOffset*: >>> >>> Events might arrive late that would belong to a previous processed time >>> window. To account to that, we have added an optional parameter that would >>> allow to >>> process immediately previous time windows as well ( acts like a buffer). >>> ex: If this is set to 1, apart from the to-be-processed data, data >>> related to the previously processed time window will also be taken for >>> processing. >>> >>> >>> *Limitations*: >>> >>> Currently, multiple time windows cannot be specified per temporary table >>> in the same script. >>> It would have to be done using different temporary tables. >>> >>> >>> >>> *Future Improvements:* >>> - Add aggregation function support for incremental processing >>> >>> >>> Thanks, >>> Sachith >>> -- >>> Sachith Withana >>> Software Engineer; WSO2 Inc.; http://wso2.com >>> E-mail: sachith AT wso2.com >>> M: +94715518127 >>> Linked-In: <http://goog_416592669> >>> https://lk.linkedin.com/in/sachithwithana >>> >> >> >> >> -- >> ============================ >> Srinath Perera, Ph.D. >> http://people.apache.org/~hemapani/ >> http://srinathsview.blogspot.com/ >> > > > > -- > Sachith Withana > Software Engineer; WSO2 Inc.; http://wso2.com > E-mail: sachith AT wso2.com > M: +94715518127 > Linked-In: <http://goog_416592669> > https://lk.linkedin.com/in/sachithwithana > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- Thanks & Regards, Inosh Goonewardena Associate Technical Lead- WSO2 Inc. Mobile: +94779966317
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
