Do we know how long archival job takes? e.g. How long to archive a GB or million records?
On Wed, Sep 4, 2013 at 3:46 PM, Anjana Fernando <[email protected]> wrote: > Hi Dipesh, > > Thank you for the ideas. Actually, yeah, we can also support archiving a > general CF without filtering the records on field like, stream version and > so on. It should be a straight forward functionality, without changing much > in the backend. > > As for using Hive, you've a point there. We also actually considered it > earlier, but thought not going in that approach, thinking it has some > limitations, where we can't address the data when the column names are not > known in Cassandra. But by looking into it more now, we identified that it > can actually be done. So yeah, we will now look again into using Hive to do > the processing. And also with that, we can easily support archiving from/to > several data sources, such as RDBMS, Cassandra, and HDFS. > > And also now, for the indexing concerns we are going to use custom index > based approach. Now actually, most probably if Hive is used, we are going > to straight away use the the functionality given by the incremental > processing, where it already contains the indexing features for timestamps. > So with these features tied in, hopefully it would be a solid > implementation. > > Cheers, > Anjana. > > On Wed, Sep 4, 2013 at 11:44 AM, Dipesh Chheda wrote: > >> Hi Malith, >> >> The current (hive-based) solution (and it seems the proposed solution) >> only >> handles Column Families (CFs) created/maintained by BAM (based on the >> stream-def). Couple of improvements would really help: >> - Currently, the archiving configuration is per 'CF+stream-def-version'. >> Is >> it possible to have just one Archive configuration that takes care of a >> given CF irrespective of the stream-def-version. >> - Archiving feature to support 'any CF' exist in a given Cassandra >> Cluster. >> We are currently using Cassandra (instead of RDBMS like MySql) to store >> Analyzed Data. Of course, the configuration would need to have name of the >> 'timestamp' column for each CF, based on which the data would be filtered >> for archiving. >> >> For Hector-based implementation, I would imagine that 'non-secondary' >> indexing on the 'timestamp column' would require to efficiently filter and >> archive the data. If you agree, how do you folks plan to handle this? If >> not >> required, how would the solution scale/perform-better without indexing? >> >> Also, in addition to archiving data from Cassandra (ActiveStore) to >> Cassandra (ArchiveStore), shouldn't it support archiving to >> traditional-SAN-like-storage-options, HDFS etc. >> I think, these other options could easily/naturally supported by Hive >> itself >> - where the hive-result could be streamed as key-value to these type of >> archive-stores. >> >> Regards, >> Dipesh >> >> >> Malith Dhanushka wrote >> > Hi folks, >> > >> > We(BAM team, Sumedha) had a discussion about the $Subject and following >> > are the suggested improvements for the Cassandra data archival feature >> in >> > BAM. >> > >> > - Remove hive script based archiving and use hector API to directly >> issue >> > archive queries to Cassandra (Current implementation is >> based >> > on hive where it generates hive script and archiving process uses >> > map-reduce jobs to achieve the task and it has a limitation of >> discarding >> > custom key value pares in column family) >> > >> > - Use Task component for scheduling purposes >> > >> > - Archive data to external Cassandra ring >> > >> > - Major UI improvements >> > - List the current archiving tasks >> > - Edit, Remove and Schedule archiving tasks >> > - Add new archiving task >> > >> > If there is any additional requirements please raise. >> > >> > Thanks, >> > Malith >> > -- >> > Malith Dhanushka >> > >> > Engineer - Data Technologies >> > *WSO2, Inc. : wso2.com* >> > >> > *Mobile* : +94 716 506 693 >> > >> > _______________________________________________ >> > Architecture mailing list >> >> > Architecture@ >> >> > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> >> >> >> >> -- >> View this message in context: >> http://wso2-oxygen-tank.10903.n7.nabble.com/BAM-Data-Archival-Feature-improvements-tp85315p85330.html >> Sent from the WSO2 Architecture mailing list archive at Nabble.com. >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> > > > > -- > *Anjana Fernando* > Technical Lead > WSO2 Inc. | http://wso2.com > lean . enterprise . middleware > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- ============================ Srinath Perera, Ph.D. http://people.apache.org/~hemapani/ http://srinathsview.blogspot.com/
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
