Hi Sachith, Anjana,

+1 for the backend model.

Are we handling the case when last run was done, 25 minutes of data is
processed. Basically, next run has to re-run last hour and update the value.

When does one hour counting starts? is it from the moment server starts?
That will be probabilistic when you restart. I think we need to either
start with know place ( midnight) or let user configure it.

I am bit concern about the syntax though. This only works for very specific
type of queries ( that includes aggregate and a group by). What happen if
user do this with a different query? Can we give clear error message?

--Srinath

On Mon, Mar 21, 2016 at 5:15 PM, Sachith Withana <[email protected]> wrote:

> Hi all,
>
> We are adding incremental processing capability to Spark in DAS.
> As the first stage, we added time slicing to Spark execution.
>
> Here's a quick introduction into that.
>
> *Execution*:
>
> In the first run of the script, it will process all the data in the given
> table and store the last processed event timestamp.
> Then from the next run onwards it would start processing starting from
> that stored timestamp.
>
> Until the query that contains the data processing part, completes, last
> processed event timestamp would not be overridden with the new value.
> This is to ensure that the data processing for the query wouldn't have to
> be done again, if the whole query fails.
> This is ensured by adding a commit query after the main query.
> Refer to the Syntax section for the example.
>
> *Syntax*:
>
> In the spark script, incremental processing support has to be specified
> per table, this would happen in the create temporary table line.
>
> ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options (tableName
> "test",
> *incrementalProcessing "T1,3600");*
>
> INSERT INTO T2 SELECT username, age GROUP BY age FROM T1;
>
> INC_TABLE_COMMIT T1;
>
> The last line is where it ensures the processing took place successfully
> and replaces the lastProcessed timestamp with the new one.
>
> *TimeWindow*:
>
> To do the incremental processing, the user has to provide the time window
> per which the data would be processed.
> In the above example. the data would be summarized in *1 hour *time
> windows.
>
> *WindowOffset*:
>
> Events might arrive late that would belong to a previous processed time
> window. To account to that, we have added an optional parameter that would
> allow to
> process immediately previous time windows as well ( acts like a buffer).
> ex: If this is set to 1, apart from the to-be-processed data, data related
> to the previously processed time window will also be taken for processing.
>
>
> *Limitations*:
>
> Currently, multiple time windows cannot be specified per temporary table
> in the same script.
> It would have to be done using different temporary tables.
>
>
>
> *Future Improvements:*
> - Add aggregation function support for incremental processing
>
>
> Thanks,
> Sachith
> --
> Sachith Withana
> Software Engineer; WSO2 Inc.; http://wso2.com
> E-mail: sachith AT wso2.com
> M: +94715518127
> Linked-In: <http://goog_416592669>
> https://lk.linkedin.com/in/sachithwithana
>



-- 
============================
Srinath Perera, Ph.D.
   http://people.apache.org/~hemapani/
   http://srinathsview.blogspot.com/
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to