[ https://issues.apache.org/jira/browse/CASSANDRA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877515#comment-13877515 ]
Jeremy Hanna commented on CASSANDRA-2527: ----------------------------------------- Maybe the conversation should shift into using the ideas from [Netflix's|http://techblog.netflix.com/2013/12/aegisthus-is-now-part-of-netflixoss.html] [Aegisthus|https://github.com/Netflix/aegisthus] and [Knewton's|http://www.knewton.com/tech/blog/2013/11/cassandra-and-hadoop-introducing-the-kassandramrhelper/] [KassandraMRHelper|https://github.com/Knewton/KassandraMRHelper] as inspiration to put something in-tree that would be maintained for each version of Cassandra. That way it could keep up with the sstable format for each version of Cassandra and have some testing in-tree for that. It would also enable people with Hadoop clusters external to a Cassandra cluster to be able to snapshot a point in time from their Cassandra cluster and have Hadoop bring over the snapshot and read over the sstables without affecting the Cassandra cluster (apart from the IO from bringing over the sstables). In this scenario, you wouldn't need a separate analytics datacenter if you chose not to go that route. > Add ability to snapshot data as input to hadoop jobs > ---------------------------------------------------- > > Key: CASSANDRA-2527 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2527 > Project: Cassandra > Issue Type: New Feature > Reporter: Jeremy Hanna > Assignee: Tyler Hobbs > Priority: Minor > Labels: hadoop > Fix For: 2.1 > > > It is desirable to have immutable inputs to hadoop jobs for the duration of > the job. That way re-execution of individual tasks do not alter the output. > One way to accomplish this would be to snapshot the data that is used as > input to a job. -- This message was sent by Atlassian JIRA (v6.1.5#6160)