[jira] [Commented] (CASSANDRA-2527) Add ability to snapshot data as input to hadoop jobs

Jeremy Hanna (JIRA) Tue, 21 Jan 2014 07:16:34 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877515#comment-13877515
 ]


Jeremy Hanna commented on CASSANDRA-2527:
-----------------------------------------

Maybe the conversation should shift into using the ideas from 
[Netflix's|http://techblog.netflix.com/2013/12/aegisthus-is-now-part-of-netflixoss.html]
 [Aegisthus|https://github.com/Netflix/aegisthus] and 
[Knewton's|http://www.knewton.com/tech/blog/2013/11/cassandra-and-hadoop-introducing-the-kassandramrhelper/]
 [KassandraMRHelper|https://github.com/Knewton/KassandraMRHelper] as 
inspiration to put something in-tree that would be maintained for each version 
of Cassandra.  That way it could keep up with the sstable format for each 
version of Cassandra and have some testing in-tree for that.

It would also enable people with Hadoop clusters external to a Cassandra 
cluster to be able to snapshot a point in time from their Cassandra cluster and 
have Hadoop bring over the snapshot and read over the sstables without 
affecting the Cassandra cluster (apart from the IO from bringing over the 
sstables).  In this scenario, you wouldn't need a separate analytics datacenter 
if you chose not to go that route.

> Add ability to snapshot data as input to hadoop jobs
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2527
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2527
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jeremy Hanna
>            Assignee: Tyler Hobbs
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 2.1
>
>
> It is desirable to have immutable inputs to hadoop jobs for the duration of 
> the job.  That way re-execution of individual tasks do not alter the output.  
> One way to accomplish this would be to snapshot the data that is used as 
> input to a job.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-2527) Add ability to snapshot data as input to hadoop jobs

Reply via email to