On Fri, Sep 6, 2013 at 7:31 AM, Srinath Perera <[email protected]> wrote:
> Do we know how long archival job takes? e.g. How long to archive a GB or > million records? > With current implementation it takes 846 seconds to archive one million records of data. These numbers might change with the latest approach. Thanks, Malith > > > On Wed, Sep 4, 2013 at 3:46 PM, Anjana Fernando <[email protected]> wrote: > >> Hi Dipesh, >> >> Thank you for the ideas. Actually, yeah, we can also support archiving a >> general CF without filtering the records on field like, stream version and >> so on. It should be a straight forward functionality, without changing much >> in the backend. >> >> As for using Hive, you've a point there. We also actually considered it >> earlier, but thought not going in that approach, thinking it has some >> limitations, where we can't address the data when the column names are not >> known in Cassandra. But by looking into it more now, we identified that it >> can actually be done. So yeah, we will now look again into using Hive to do >> the processing. And also with that, we can easily support archiving from/to >> several data sources, such as RDBMS, Cassandra, and HDFS. >> >> And also now, for the indexing concerns we are going to use custom index >> based approach. Now actually, most probably if Hive is used, we are going >> to straight away use the the functionality given by the incremental >> processing, where it already contains the indexing features for timestamps. >> So with these features tied in, hopefully it would be a solid >> implementation. >> >> Cheers, >> Anjana. >> >> On Wed, Sep 4, 2013 at 11:44 AM, Dipesh Chheda wrote: >> >>> Hi Malith, >>> >>> The current (hive-based) solution (and it seems the proposed solution) >>> only >>> handles Column Families (CFs) created/maintained by BAM (based on the >>> stream-def). Couple of improvements would really help: >>> - Currently, the archiving configuration is per >>> 'CF+stream-def-version'. Is >>> it possible to have just one Archive configuration that takes care of a >>> given CF irrespective of the stream-def-version. >>> - Archiving feature to support 'any CF' exist in a given Cassandra >>> Cluster. >>> We are currently using Cassandra (instead of RDBMS like MySql) to store >>> Analyzed Data. Of course, the configuration would need to have name of >>> the >>> 'timestamp' column for each CF, based on which the data would be filtered >>> for archiving. >>> >>> For Hector-based implementation, I would imagine that 'non-secondary' >>> indexing on the 'timestamp column' would require to efficiently filter >>> and >>> archive the data. If you agree, how do you folks plan to handle this? If >>> not >>> required, how would the solution scale/perform-better without indexing? >>> >>> Also, in addition to archiving data from Cassandra (ActiveStore) to >>> Cassandra (ArchiveStore), shouldn't it support archiving to >>> traditional-SAN-like-storage-options, HDFS etc. >>> I think, these other options could easily/naturally supported by Hive >>> itself >>> - where the hive-result could be streamed as key-value to these type of >>> archive-stores. >>> >>> Regards, >>> Dipesh >>> >>> >>> Malith Dhanushka wrote >>> > Hi folks, >>> > >>> > We(BAM team, Sumedha) had a discussion about the $Subject and >>> following >>> > are the suggested improvements for the Cassandra data archival feature >>> in >>> > BAM. >>> > >>> > - Remove hive script based archiving and use hector API to directly >>> issue >>> > archive queries to Cassandra (Current implementation is >>> based >>> > on hive where it generates hive script and archiving process uses >>> > map-reduce jobs to achieve the task and it has a limitation of >>> discarding >>> > custom key value pares in column family) >>> > >>> > - Use Task component for scheduling purposes >>> > >>> > - Archive data to external Cassandra ring >>> > >>> > - Major UI improvements >>> > - List the current archiving tasks >>> > - Edit, Remove and Schedule archiving tasks >>> > - Add new archiving task >>> > >>> > If there is any additional requirements please raise. >>> > >>> > Thanks, >>> > Malith >>> > -- >>> > Malith Dhanushka >>> > >>> > Engineer - Data Technologies >>> > *WSO2, Inc. : wso2.com* >>> > >>> > *Mobile* : +94 716 506 693 >>> > >>> > _______________________________________________ >>> > Architecture mailing list >>> >>> > Architecture@ >>> >>> > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://wso2-oxygen-tank.10903.n7.nabble.com/BAM-Data-Archival-Feature-improvements-tp85315p85330.html >>> Sent from the WSO2 Architecture mailing list archive at Nabble.com. >>> _______________________________________________ >>> Architecture mailing list >>> [email protected] >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >> >> >> >> -- >> *Anjana Fernando* >> Technical Lead >> WSO2 Inc. | http://wso2.com >> lean . enterprise . middleware >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > ============================ > Srinath Perera, Ph.D. > http://people.apache.org/~hemapani/ > http://srinathsview.blogspot.com/ > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- Malith Dhanushka Engineer - Data Technologies *WSO2, Inc. : wso2.com* *Mobile* : +94 716 506 693
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
