Hi Dipesh,

Thank you for the ideas. Actually, yeah, we can also support archiving a
general CF without filtering the records on field like, stream version and
so on. It should be a straight forward functionality, without changing much
in the backend.

As for using Hive, you've a point there. We also actually considered it
earlier, but thought not going in that approach, thinking it has some
limitations, where we can't address the data when the column names are not
known in Cassandra. But by looking into it more now, we identified that it
can actually be done. So yeah, we will now look again into using Hive to do
the processing. And also with that, we can easily support archiving from/to
several data sources, such as RDBMS, Cassandra, and HDFS.

And also now, for the indexing concerns we are going to use custom index
based approach. Now actually, most probably if Hive is used, we are going
to straight away use the the functionality given by the incremental
processing, where it already contains the indexing features for timestamps.
So with these features tied in, hopefully it would be a solid
implementation.

Cheers,
Anjana.

On Wed, Sep 4, 2013 at 11:44 AM, Dipesh Chheda wrote:

> Hi Malith,
>
> The current (hive-based) solution (and it seems the proposed solution) only
> handles Column Families (CFs) created/maintained by BAM (based on the
> stream-def). Couple of improvements would really help:
>  - Currently, the archiving configuration is per 'CF+stream-def-version'.
> Is
> it possible to have just one Archive configuration that takes care of a
> given CF irrespective of the stream-def-version.
>  - Archiving feature to support 'any CF' exist in a given Cassandra
> Cluster.
> We are currently using Cassandra (instead of RDBMS like MySql) to store
> Analyzed Data. Of course, the configuration would need to have name of the
> 'timestamp' column for each CF, based on which the data would be filtered
> for archiving.
>
> For Hector-based implementation, I would imagine that 'non-secondary'
> indexing on the 'timestamp column' would require to efficiently filter and
> archive the data. If you agree, how do you folks plan to handle this? If
> not
> required, how would the solution scale/perform-better without indexing?
>
> Also, in addition to archiving data from Cassandra (ActiveStore) to
> Cassandra (ArchiveStore), shouldn't it support archiving to
> traditional-SAN-like-storage-options, HDFS etc.
> I think, these other options could easily/naturally supported by Hive
> itself
> - where the hive-result could be streamed as key-value to these type of
> archive-stores.
>
> Regards,
> Dipesh
>
>
> Malith Dhanushka wrote
> > Hi folks,
> >
> > We(BAM team, Sumedha) had a  discussion about the $Subject and following
> > are the suggested improvements for the Cassandra data archival feature in
> > BAM.
> >
> > - Remove hive script based archiving and use hector API to directly issue
> > archive queries to             Cassandra  (Current implementation is
> based
> > on hive where it generates hive script and archiving process uses
> > map-reduce jobs to achieve the task and it has a limitation of discarding
> > custom key value pares in column family)
> >
> > - Use Task component for scheduling purposes
> >
> > - Archive data to external Cassandra ring
> >
> > - Major UI improvements
> >     - List the current archiving tasks
> >     - Edit, Remove and Schedule archiving tasks
> >     - Add new archiving task
> >
> > If there is any additional requirements please raise.
> >
> > Thanks,
> > Malith
> > --
> > Malith Dhanushka
> >
> > Engineer - Data Technologies
> > *WSO2, Inc. : wso2.com*
> >
> > *Mobile*          : +94 716 506 693
> >
> > _______________________________________________
> > Architecture mailing list
>
> > Architecture@
>
> > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>
>
>
>
> --
> View this message in context:
> http://wso2-oxygen-tank.10903.n7.nabble.com/BAM-Data-Archival-Feature-improvements-tp85315p85330.html
> Sent from the WSO2 Architecture mailing list archive at Nabble.com.
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>



-- 
*Anjana Fernando*
Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to