[ 
https://issues.apache.org/jira/browse/CASSANDRA-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502016#comment-13502016
 ] 

Scott Fines commented on CASSANDRA-2388:
----------------------------------------

I have two distinct use-cases where running TaskTrackers alongside Cassandra 
nodes does not accomplish our goals:

1. Joining data. We have a large data set in cassandra, true, but we have a 
*much* larger data set held in Hadoop itself (around 4 orders of magnitude 
larger in hadoop than in cassandra). We need to join the two datasets together, 
and use the output from that join to feed multiple systems, none of which are 
cassandra. Since the data in Hadoop is so much larger than that in Cassandra, 
we have to bring the Cassandra data to hadoop, not the other way around. 
Because of security concerns, we can't spread our hadoop data onto our 
cassandra nodes (even if that didn't screw with our capacity planning), so we 
have no other choice but to move the Cassandra data (in small chunks) onto 
Hadoop. Why not use HBase, you say? We needed Cassandra for its write 
performance for other problems than this one. 

1. Offline, incremental backups. We have a large volume of time-series data 
held in Cassandra, and taking nightly snapshots and moving them to our archival 
center is prohibitively slow--it turns out that moving RF copies of our entire 
dataset over a leased line every night is a pretty bad idea. Instead, I use 
MapReduce to take an incremental backup of a much smaller subset of the data, 
then move that. That way, we not only are not moving the entire data set, but 
we are also using Cassandra's consistency mechanisms to resolve all the 
replicas. The only efficient way I've found to do this is via MapReduce (we use 
the Random Partitioner), and since it's an offline backup, we need to move it 
over the network anyway--may as well use the optimized network connecting 
Hadoop and Cassandra instead of the tiny pipe connecting cassandra to our 
archival center. 

Both of these reasons dictate that we *not* run a TT alongside our Cassandra 
nodes, no matter what the *recommended* approach is. In this case, we need a 
strong, fault-tolerant CFIF to serve our purposes.


                
> ColumnFamilyRecordReader fails for a given split because a host is down, even 
> if records could reasonably be read from other replica.
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2388
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2388
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Eldon Stegall
>            Assignee: Mck SembWever
>            Priority: Minor
>              Labels: hadoop, inputformat
>             Fix For: 1.1.7
>
>         Attachments: 0002_On_TException_try_next_split.patch, 
> CASSANDRA-2388-addition1.patch, CASSANDRA-2388-extended.patch, 
> CASSANDRA-2388.patch, CASSANDRA-2388.patch, CASSANDRA-2388.patch, 
> CASSANDRA-2388.patch
>
>
> ColumnFamilyRecordReader only tries the first location for a given split. We 
> should try multiple locations for a given split.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to