[
https://issues.apache.org/jira/browse/CASSANDRA-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502016#comment-13502016
]
Scott Fines commented on CASSANDRA-2388:
----------------------------------------
I have two distinct use-cases where running TaskTrackers alongside Cassandra
nodes does not accomplish our goals:
1. Joining data. We have a large data set in cassandra, true, but we have a
*much* larger data set held in Hadoop itself (around 4 orders of magnitude
larger in hadoop than in cassandra). We need to join the two datasets together,
and use the output from that join to feed multiple systems, none of which are
cassandra. Since the data in Hadoop is so much larger than that in Cassandra,
we have to bring the Cassandra data to hadoop, not the other way around.
Because of security concerns, we can't spread our hadoop data onto our
cassandra nodes (even if that didn't screw with our capacity planning), so we
have no other choice but to move the Cassandra data (in small chunks) onto
Hadoop. Why not use HBase, you say? We needed Cassandra for its write
performance for other problems than this one.
1. Offline, incremental backups. We have a large volume of time-series data
held in Cassandra, and taking nightly snapshots and moving them to our archival
center is prohibitively slow--it turns out that moving RF copies of our entire
dataset over a leased line every night is a pretty bad idea. Instead, I use
MapReduce to take an incremental backup of a much smaller subset of the data,
then move that. That way, we not only are not moving the entire data set, but
we are also using Cassandra's consistency mechanisms to resolve all the
replicas. The only efficient way I've found to do this is via MapReduce (we use
the Random Partitioner), and since it's an offline backup, we need to move it
over the network anyway--may as well use the optimized network connecting
Hadoop and Cassandra instead of the tiny pipe connecting cassandra to our
archival center.
Both of these reasons dictate that we *not* run a TT alongside our Cassandra
nodes, no matter what the *recommended* approach is. In this case, we need a
strong, fault-tolerant CFIF to serve our purposes.
> ColumnFamilyRecordReader fails for a given split because a host is down, even
> if records could reasonably be read from other replica.
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2388
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2388
> Project: Cassandra
> Issue Type: Bug
> Components: Hadoop
> Affects Versions: 0.6
> Reporter: Eldon Stegall
> Assignee: Mck SembWever
> Priority: Minor
> Labels: hadoop, inputformat
> Fix For: 1.1.7
>
> Attachments: 0002_On_TException_try_next_split.patch,
> CASSANDRA-2388-addition1.patch, CASSANDRA-2388-extended.patch,
> CASSANDRA-2388.patch, CASSANDRA-2388.patch, CASSANDRA-2388.patch,
> CASSANDRA-2388.patch
>
>
> ColumnFamilyRecordReader only tries the first location for a given split. We
> should try multiple locations for a given split.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira