Re: [Architecture] [Update] Incremental Processing for BAM

Sinthuja Ragendran Mon, 10 Feb 2014 07:15:44 -0800

Hi Srinath,



On Mon, Feb 10, 2014 at 7:51 PM, Srinath Perera <[email protected]> wrote:

> Hi Sinthuja,
>
> Do we now support processing data for a given time period? e.g. last 24
> hours.
>

Yes we do support that. We can do this by following ways.

1) We can configure the start time and end time of data which needs to be
considered for the current hive query execution with following annotation.

*@Incremental(name="incrSample", tables="cassandraTable1",
fromTime="$NOW-86400000", toTime="$NOW")*

Here 24hours = 86400000 ms. Hence it will consider last 24hours data.


2) We can also schedule the hive script to run every 24hours and then we
can annotate the hive script with following annotation:

@Incremental(name="incrementalProcess", tables="cassandraReadTable",
bufferTime="0").

Hence the query only will consider the unprocessed last 24hours data.




> I am -1 on doing anything more on this for now (e.g. incr_count).
>

As far as we can confirm incr_avg() sample function is working, the
implementation of other functions also very similar. And also it will be
very useful for the users to use this functionality IMHO.

>
> IMHO, there are more high priority items in the roadmap, please chat with
> Anjana.
>

According to discussion I had previously with Anjana, first we have to
start with sample function (incr_avg()) and then implement that to other
functions. May be now the priorities have been changed and we can push this
for later releases. :)   Anjana, Please confirm.

Thanks,
Sinthuja.

>
> --Srinath
>
>
> On Mon, Feb 10, 2014 at 2:42 PM, Sinthuja Ragendran <[email protected]>wrote:
>
>> Hi all,
>>
>> I was working on providing the incremental processing support for BAM.
>> This feature was implemented in BAM 2.4.0 [1], but it wasn't well tested in
>> fully distributed mode during the BAM 2.4.0 release, hence it was marked as
>> experimental feature. You can find the implementation details of this
>> feature from [2].
>>
>> During the last week I was involved with testing this feature with 3 node
>> hadoop cluster and Cassandra cluster. And all use cases mentioned in [1]
>> was able to run in the external cluster without any issues.
>>
>> And also I have implemented incremental average operation (incr_avg())
>> which incrementally calculates the final value based on the last hive query
>> execution and the current execution. The intermediate necessary values for
>> incremental operation, are store in cassandra and used for the current hive
>> query execution. And once the current hive query execution is completed,
>> the intermediate results will be again replaced with the new values. I have
>> tested this incr_avg() operation also with fully distributed setup and it
>> works well. As the first step, I have only implemented the incr_avg() to
>> make sure whether the adopted approach will succeed on fully distributed
>> setup.
>>
>> The below is the hive script sample which uses the incr_avg() function:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *CREATE EXTERNAL TABLE IF NOT EXISTS PhoneSalesTable   (orderID STRING,
>> brandName STRING, userName STRING, quantity INT,     version STRING) STORED
>> BY   'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'   WITH
>> SERDEPROPERTIES (     "wso2.carbon.datasource.name
>> <http://wso2.carbon.datasource.name>" = "WSO2BAM_CASSANDRA_DATASOURCE",
>> "cassandra.cf.name <http://cassandra.cf.name>" =
>> "org_wso2_bam_phone_retail_store_kpi" ,   "cassandra.columns.mapping" =
>> ":key,payload_brand, payload_user, payload_quantity, Version" );
>> @Incremental(name="avgAnalysis", tables="PhoneSalesTable",
>> bufferTime="20")select brandName, count(DISTINCT orderID),
>> incr_avg(quantity, "average_quantity")  from PhoneSalesTable    where
>> version= "1.0.0" group by brandName;*
>>
>> The following are the to-do items on this feature:
>>
>>    - Do a load test with all cases/scenarios with distributed setup
>>    - Implement more commonly used hive functions as incremental
>>    functions(incr_count, incr_sum, etc).
>>    - Write a sample to explain how to use this feature.
>>
>>
>> [1] http://docs.wso2.org/pages/viewpage.action?pageId=32345660
>> [2]
>> http://wso2-oxygen-tank.10903.n7.nabble.com/Incremental-Data-Processing-for-BAM-td77582.html
>>
>> Thanks,
>> Sinthuja.
>>
>> --
>> *Sinthuja Rajendran*
>> Software Engineer <http://wso2.com/>
>> WSO2, Inc.:http://wso2.com
>>
>> Blog: http://sinthu-rajan.blogspot.com/
>> Mobile: +94774273955
>>
>>
>>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>   Director, Research, WSO2 Inc.
>   Visiting Faculty, University of Moratuwa
>   Member, Apache Software Foundation
>   Research Scientist, Lanka Software Foundation
>   Blog: http://srinathsview.blogspot.com/
>   Photos: http://www.flickr.com/photos/hemapani/
>    Phone: 0772360902
>



-- 
*Sinthuja Rajendran*
Software Engineer <http://wso2.com/>
WSO2, Inc.:http://wso2.com

Blog: http://sinthu-rajan.blogspot.com/
Mobile: +94774273955

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [Update] Incremental Processing for BAM

Reply via email to