Hi all, First of all +1 for the idea since it would be a great addition on top of the current incremental implementation.
As Maninda mentioned, it's really tricky to handle this from the timestmap itself since we'll have to deal with events with timestamps smaller than that of the processed, arriving late. A better solution (IMO) would be to add another column to the tables from the DAL layer itself, a* row ID with auto increments.* Just as we do in the current incremental analytics, we can keep the *last processed rowID.* And in the next iteration we'll get rows of which the rowID > lastProceesedRowID. This way, the changes would be minimal, you add another column, based on the user preference we either keep last processed timestamp or the last processed row ID. WDYT? Regards, Sachith On Wed, Jun 8, 2016 at 3:42 AM, Maninda Edirisooriya <[email protected]> wrote: > Hi Gihan, > > Yes I am referring to the full incremental analytics we are planing on. > What I mean is that we will have to extend Data Layer to add a column / > meta data column to the table in order to add a meta data fields. Then we > can add fields like "processed_query1" to the table when an incremental > query like "query1" is added. > > Actually this implementation violates the independence of Data Layer table > from the analytics it is involved, when metadata columns are added to the > same table. So the other alternative is to add a separate data layer table > to each incremental query to keep the processed state. (e.g. That table > should have the columns of primary key of the real table + the boolean > field, "processed") But in this approach when each record in the real table > is being processed it should check the processed flag of that record in the > meta data table which has a n^2 complexity. WDYT? > Thanks. > > > *Maninda Edirisooriya* > Senior Software Engineer > > *WSO2, Inc.*lean.enterprise.middleware. > > *Blog* : http://maninda.blogspot.com/ > *E-mail* : [email protected] > *Skype* : @manindae > *Twitter* : @maninda > > On Wed, Jun 8, 2016 at 1:42 PM, Gihan Anuruddha <[email protected]> wrote: > >> Hi Maninda, >> >> We have introduced some of the incremental data processing capabilities >> with upcoming 3.1.0 release. Please note that this doesn't support fully >> functional data processing with data aggregation functionalities. Basically >> what we have done is, introduced a way to fetch data based on time windows >> to avoid iterate same data set from the beginning again and again. To avoid >> the data losses, we have introduced some buffer time period and due to that >> some of the events may return for select queries more than once in a >> consecutive analytics task executions. Because of that, some aggregation >> operations like average can be wrong. We have a plan to introduce fully >> functional incremental data processing support in a future DAS release. >> >> Regards, >> Gihan >> >> On Wed, Jun 8, 2016 at 11:53 AM, Maninda Edirisooriya <[email protected]> >> wrote: >> >>> [Adding Architecture list] >>> >>> Hi all, >>> >>> Timestamp based approach for incremental processing is problematic as we >>> have gone through long discussions on it and could not come to an >>> acceptable solution. Instead I think following kind of approach would work. >>> >>> 1. For each incremental analytic script a metadata column is added to >>> the analytics table with type boolean with name "processed" with value >>> "false". >>> 2. When an incremental script is executed on a data row, that particular >>> row should get updated with *processed=true.* >>> 3. Next time when the script get executed it can skip all the rows with >>> field >>> >>> *processed=true.* >>> This will avoid the timestamp restriction and buffer time issues and >>> allow parallel execution on records. >>> Thanks. >>> >>> >>> *Maninda Edirisooriya* >>> Senior Software Engineer >>> >>> *WSO2, Inc.*lean.enterprise.middleware. >>> >>> *Blog* : http://maninda.blogspot.com/ >>> *E-mail* : [email protected] >>> *Skype* : @manindae >>> *Twitter* : @maninda >>> >>> On Wed, Jun 8, 2016 at 11:22 AM, Gihan Anuruddha <[email protected]> wrote: >>> >>>> Hi Guys, >>>> >>>> To fulfill above requirement, we can add query as below and make >>>> necessary changes to back-end. >>>> >>>> *create temporary table t5 using CarbonAnalytics options (tableName >>>> "t3", schema "x INT, y INT", incrementalParams "t5, -1");* >>>> >>>> Basically, we are passing -1 for buffer time. In the backend, if the >>>> buffer is -1 we only take last processed event timestamp and fetch the >>>> data. >>>> >>>> If we insert 3 records and do the commit when the buffer is -1 and then >>>> next time do the select without inserting any records, we are not getting >>>> any result since after the saved timestamp there was no new record >>>> inserted. >>>> >>>> So what do you think about this implementation? >>>> >>>> Regards, >>>> Gihan >>>> >>>> >>>> -- >>>> W.G. Gihan Anuruddha >>>> Senior Software Engineer | WSO2, Inc. >>>> M: +94772272595 >>>> >>> >>> >> >> >> -- >> W.G. Gihan Anuruddha >> Senior Software Engineer | WSO2, Inc. >> M: +94772272595 >> > > -- Sachith Withana Software Engineer; WSO2 Inc.; http://wso2.com E-mail: sachith AT wso2.com M: +94715518127 Linked-In: <http://goog_416592669>https://lk.linkedin.com/in/sachithwithana
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
