Hi Dipesh, Thank you for the ideas. Actually, yeah, we can also support archiving a general CF without filtering the records on field like, stream version and so on. It should be a straight forward functionality, without changing much in the backend.
As for using Hive, you've a point there. We also actually considered it earlier, but thought not going in that approach, thinking it has some limitations, where we can't address the data when the column names are not known in Cassandra. But by looking into it more now, we identified that it can actually be done. So yeah, we will now look again into using Hive to do the processing. And also with that, we can easily support archiving from/to several data sources, such as RDBMS, Cassandra, and HDFS. And also now, for the indexing concerns we are going to use custom index based approach. Now actually, most probably if Hive is used, we are going to straight away use the the functionality given by the incremental processing, where it already contains the indexing features for timestamps. So with these features tied in, hopefully it would be a solid implementation. Cheers, Anjana. On Wed, Sep 4, 2013 at 11:44 AM, Dipesh Chheda wrote: > Hi Malith, > > The current (hive-based) solution (and it seems the proposed solution) only > handles Column Families (CFs) created/maintained by BAM (based on the > stream-def). Couple of improvements would really help: > - Currently, the archiving configuration is per 'CF+stream-def-version'. > Is > it possible to have just one Archive configuration that takes care of a > given CF irrespective of the stream-def-version. > - Archiving feature to support 'any CF' exist in a given Cassandra > Cluster. > We are currently using Cassandra (instead of RDBMS like MySql) to store > Analyzed Data. Of course, the configuration would need to have name of the > 'timestamp' column for each CF, based on which the data would be filtered > for archiving. > > For Hector-based implementation, I would imagine that 'non-secondary' > indexing on the 'timestamp column' would require to efficiently filter and > archive the data. If you agree, how do you folks plan to handle this? If > not > required, how would the solution scale/perform-better without indexing? > > Also, in addition to archiving data from Cassandra (ActiveStore) to > Cassandra (ArchiveStore), shouldn't it support archiving to > traditional-SAN-like-storage-options, HDFS etc. > I think, these other options could easily/naturally supported by Hive > itself > - where the hive-result could be streamed as key-value to these type of > archive-stores. > > Regards, > Dipesh > > > Malith Dhanushka wrote > > Hi folks, > > > > We(BAM team, Sumedha) had a discussion about the $Subject and following > > are the suggested improvements for the Cassandra data archival feature in > > BAM. > > > > - Remove hive script based archiving and use hector API to directly issue > > archive queries to Cassandra (Current implementation is > based > > on hive where it generates hive script and archiving process uses > > map-reduce jobs to achieve the task and it has a limitation of discarding > > custom key value pares in column family) > > > > - Use Task component for scheduling purposes > > > > - Archive data to external Cassandra ring > > > > - Major UI improvements > > - List the current archiving tasks > > - Edit, Remove and Schedule archiving tasks > > - Add new archiving task > > > > If there is any additional requirements please raise. > > > > Thanks, > > Malith > > -- > > Malith Dhanushka > > > > Engineer - Data Technologies > > *WSO2, Inc. : wso2.com* > > > > *Mobile* : +94 716 506 693 > > > > _______________________________________________ > > Architecture mailing list > > > Architecture@ > > > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > > > > > -- > View this message in context: > http://wso2-oxygen-tank.10903.n7.nabble.com/BAM-Data-Archival-Feature-improvements-tp85315p85330.html > Sent from the WSO2 Architecture mailing list archive at Nabble.com. > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > -- *Anjana Fernando* Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
