Re: [Architecture] BAM Data Archival Feature improvements

Srinath Perera Thu, 05 Sep 2013 19:03:55 -0700

Do we know how long archival job takes? e.g. How long to archive a GB or
million records?



On Wed, Sep 4, 2013 at 3:46 PM, Anjana Fernando <[email protected]> wrote:

> Hi Dipesh,
>
> Thank you for the ideas. Actually, yeah, we can also support archiving a
> general CF without filtering the records on field like, stream version and
> so on. It should be a straight forward functionality, without changing much
> in the backend.
>
> As for using Hive, you've a point there. We also actually considered it
> earlier, but thought not going in that approach, thinking it has some
> limitations, where we can't address the data when the column names are not
> known in Cassandra. But by looking into it more now, we identified that it
> can actually be done. So yeah, we will now look again into using Hive to do
> the processing. And also with that, we can easily support archiving from/to
> several data sources, such as RDBMS, Cassandra, and HDFS.
>
> And also now, for the indexing concerns we are going to use custom index
> based approach. Now actually, most probably if Hive is used, we are going
> to straight away use the the functionality given by the incremental
> processing, where it already contains the indexing features for timestamps.
> So with these features tied in, hopefully it would be a solid
> implementation.
>
> Cheers,
> Anjana.
>
> On Wed, Sep 4, 2013 at 11:44 AM, Dipesh Chheda wrote:
>
>> Hi Malith,
>>
>> The current (hive-based) solution (and it seems the proposed solution)
>> only
>> handles Column Families (CFs) created/maintained by BAM (based on the
>> stream-def). Couple of improvements would really help:
>>  - Currently, the archiving configuration is per 'CF+stream-def-version'.
>> Is
>> it possible to have just one Archive configuration that takes care of a
>> given CF irrespective of the stream-def-version.
>>  - Archiving feature to support 'any CF' exist in a given Cassandra
>> Cluster.
>> We are currently using Cassandra (instead of RDBMS like MySql) to store
>> Analyzed Data. Of course, the configuration would need to have name of the
>> 'timestamp' column for each CF, based on which the data would be filtered
>> for archiving.
>>
>> For Hector-based implementation, I would imagine that 'non-secondary'
>> indexing on the 'timestamp column' would require to efficiently filter and
>> archive the data. If you agree, how do you folks plan to handle this? If
>> not
>> required, how would the solution scale/perform-better without indexing?
>>
>> Also, in addition to archiving data from Cassandra (ActiveStore) to
>> Cassandra (ArchiveStore), shouldn't it support archiving to
>> traditional-SAN-like-storage-options, HDFS etc.
>> I think, these other options could easily/naturally supported by Hive
>> itself
>> - where the hive-result could be streamed as key-value to these type of
>> archive-stores.
>>
>> Regards,
>> Dipesh
>>
>>
>> Malith Dhanushka wrote
>> > Hi folks,
>> >
>> > We(BAM team, Sumedha) had a  discussion about the $Subject and following
>> > are the suggested improvements for the Cassandra data archival feature
>> in
>> > BAM.
>> >
>> > - Remove hive script based archiving and use hector API to directly
>> issue
>> > archive queries to             Cassandra  (Current implementation is
>> based
>> > on hive where it generates hive script and archiving process uses
>> > map-reduce jobs to achieve the task and it has a limitation of
>> discarding
>> > custom key value pares in column family)
>> >
>> > - Use Task component for scheduling purposes
>> >
>> > - Archive data to external Cassandra ring
>> >
>> > - Major UI improvements
>> >     - List the current archiving tasks
>> >     - Edit, Remove and Schedule archiving tasks
>> >     - Add new archiving task
>> >
>> > If there is any additional requirements please raise.
>> >
>> > Thanks,
>> > Malith
>> > --
>> > Malith Dhanushka
>> >
>> > Engineer - Data Technologies
>> > *WSO2, Inc. : wso2.com*
>> >
>> > *Mobile*          : +94 716 506 693
>> >
>> > _______________________________________________
>> > Architecture mailing list
>>
>> > Architecture@
>>
>> > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://wso2-oxygen-tank.10903.n7.nabble.com/BAM-Data-Archival-Feature-improvements-tp85315p85330.html
>> Sent from the WSO2 Architecture mailing list archive at Nabble.com.
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>
>
>
> --
> *Anjana Fernando*
> Technical Lead
> WSO2 Inc. | http://wso2.com
> lean . enterprise . middleware
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
============================
Srinath Perera, Ph.D.
   http://people.apache.org/~hemapani/
   http://srinathsview.blogspot.com/

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] BAM Data Archival Feature improvements

Reply via email to