Re: [Architecture] BAM Data Archival Feature improvements

Malith Dhanushka Fri, 13 Sep 2013 11:50:06 -0700

On Fri, Sep 6, 2013 at 7:31 AM, Srinath Perera <[email protected]> wrote:


> Do we know how long archival job takes? e.g. How long to archive a GB or
> million records?
>

With current implementation it takes 846 seconds to archive one million
records of data. These numbers might change with the latest approach.

Thanks,
Malith

>
>
> On Wed, Sep 4, 2013 at 3:46 PM, Anjana Fernando <[email protected]> wrote:
>
>> Hi Dipesh,
>>
>> Thank you for the ideas. Actually, yeah, we can also support archiving a
>> general CF without filtering the records on field like, stream version and
>> so on. It should be a straight forward functionality, without changing much
>> in the backend.
>>
>> As for using Hive, you've a point there. We also actually considered it
>> earlier, but thought not going in that approach, thinking it has some
>> limitations, where we can't address the data when the column names are not
>> known in Cassandra. But by looking into it more now, we identified that it
>> can actually be done. So yeah, we will now look again into using Hive to do
>> the processing. And also with that, we can easily support archiving from/to
>> several data sources, such as RDBMS, Cassandra, and HDFS.
>>
>> And also now, for the indexing concerns we are going to use custom index
>> based approach. Now actually, most probably if Hive is used, we are going
>> to straight away use the the functionality given by the incremental
>> processing, where it already contains the indexing features for timestamps.
>> So with these features tied in, hopefully it would be a solid
>> implementation.
>>
>> Cheers,
>> Anjana.
>>
>> On Wed, Sep 4, 2013 at 11:44 AM, Dipesh Chheda wrote:
>>
>>> Hi Malith,
>>>
>>> The current (hive-based) solution (and it seems the proposed solution)
>>> only
>>> handles Column Families (CFs) created/maintained by BAM (based on the
>>> stream-def). Couple of improvements would really help:
>>>  - Currently, the archiving configuration is per
>>> 'CF+stream-def-version'. Is
>>> it possible to have just one Archive configuration that takes care of a
>>> given CF irrespective of the stream-def-version.
>>>  - Archiving feature to support 'any CF' exist in a given Cassandra
>>> Cluster.
>>> We are currently using Cassandra (instead of RDBMS like MySql) to store
>>> Analyzed Data. Of course, the configuration would need to have name of
>>> the
>>> 'timestamp' column for each CF, based on which the data would be filtered
>>> for archiving.
>>>
>>> For Hector-based implementation, I would imagine that 'non-secondary'
>>> indexing on the 'timestamp column' would require to efficiently filter
>>> and
>>> archive the data. If you agree, how do you folks plan to handle this? If
>>> not
>>> required, how would the solution scale/perform-better without indexing?
>>>
>>> Also, in addition to archiving data from Cassandra (ActiveStore) to
>>> Cassandra (ArchiveStore), shouldn't it support archiving to
>>> traditional-SAN-like-storage-options, HDFS etc.
>>> I think, these other options could easily/naturally supported by Hive
>>> itself
>>> - where the hive-result could be streamed as key-value to these type of
>>> archive-stores.
>>>
>>> Regards,
>>> Dipesh
>>>
>>>
>>> Malith Dhanushka wrote
>>> > Hi folks,
>>> >
>>> > We(BAM team, Sumedha) had a  discussion about the $Subject and
>>> following
>>> > are the suggested improvements for the Cassandra data archival feature
>>> in
>>> > BAM.
>>> >
>>> > - Remove hive script based archiving and use hector API to directly
>>> issue
>>> > archive queries to             Cassandra  (Current implementation is
>>> based
>>> > on hive where it generates hive script and archiving process uses
>>> > map-reduce jobs to achieve the task and it has a limitation of
>>> discarding
>>> > custom key value pares in column family)
>>> >
>>> > - Use Task component for scheduling purposes
>>> >
>>> > - Archive data to external Cassandra ring
>>> >
>>> > - Major UI improvements
>>> >     - List the current archiving tasks
>>> >     - Edit, Remove and Schedule archiving tasks
>>> >     - Add new archiving task
>>> >
>>> > If there is any additional requirements please raise.
>>> >
>>> > Thanks,
>>> > Malith
>>> > --
>>> > Malith Dhanushka
>>> >
>>> > Engineer - Data Technologies
>>> > *WSO2, Inc. : wso2.com*
>>> >
>>> > *Mobile*          : +94 716 506 693
>>> >
>>> > _______________________________________________
>>> > Architecture mailing list
>>>
>>> > Architecture@
>>>
>>> > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://wso2-oxygen-tank.10903.n7.nabble.com/BAM-Data-Archival-Feature-improvements-tp85315p85330.html
>>> Sent from the WSO2 Architecture mailing list archive at Nabble.com.
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>
>>
>>
>> --
>> *Anjana Fernando*
>> Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>    http://people.apache.org/~hemapani/
>    http://srinathsview.blogspot.com/
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
Malith Dhanushka

Engineer - Data Technologies
*WSO2, Inc. : wso2.com*

*Mobile*          : +94 716 506 693

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] BAM Data Archival Feature improvements

Reply via email to