Hi Sachith, Anjana, +1 for the backend model.
Are we handling the case when last run was done, 25 minutes of data is processed. Basically, next run has to re-run last hour and update the value. When does one hour counting starts? is it from the moment server starts? That will be probabilistic when you restart. I think we need to either start with know place ( midnight) or let user configure it. I am bit concern about the syntax though. This only works for very specific type of queries ( that includes aggregate and a group by). What happen if user do this with a different query? Can we give clear error message? --Srinath On Mon, Mar 21, 2016 at 5:15 PM, Sachith Withana <[email protected]> wrote: > Hi all, > > We are adding incremental processing capability to Spark in DAS. > As the first stage, we added time slicing to Spark execution. > > Here's a quick introduction into that. > > *Execution*: > > In the first run of the script, it will process all the data in the given > table and store the last processed event timestamp. > Then from the next run onwards it would start processing starting from > that stored timestamp. > > Until the query that contains the data processing part, completes, last > processed event timestamp would not be overridden with the new value. > This is to ensure that the data processing for the query wouldn't have to > be done again, if the whole query fails. > This is ensured by adding a commit query after the main query. > Refer to the Syntax section for the example. > > *Syntax*: > > In the spark script, incremental processing support has to be specified > per table, this would happen in the create temporary table line. > > ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options (tableName > "test", > *incrementalProcessing "T1,3600");* > > INSERT INTO T2 SELECT username, age GROUP BY age FROM T1; > > INC_TABLE_COMMIT T1; > > The last line is where it ensures the processing took place successfully > and replaces the lastProcessed timestamp with the new one. > > *TimeWindow*: > > To do the incremental processing, the user has to provide the time window > per which the data would be processed. > In the above example. the data would be summarized in *1 hour *time > windows. > > *WindowOffset*: > > Events might arrive late that would belong to a previous processed time > window. To account to that, we have added an optional parameter that would > allow to > process immediately previous time windows as well ( acts like a buffer). > ex: If this is set to 1, apart from the to-be-processed data, data related > to the previously processed time window will also be taken for processing. > > > *Limitations*: > > Currently, multiple time windows cannot be specified per temporary table > in the same script. > It would have to be done using different temporary tables. > > > > *Future Improvements:* > - Add aggregation function support for incremental processing > > > Thanks, > Sachith > -- > Sachith Withana > Software Engineer; WSO2 Inc.; http://wso2.com > E-mail: sachith AT wso2.com > M: +94715518127 > Linked-In: <http://goog_416592669> > https://lk.linkedin.com/in/sachithwithana > -- ============================ Srinath Perera, Ph.D. http://people.apache.org/~hemapani/ http://srinathsview.blogspot.com/
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
