[
https://issues.apache.org/jira/browse/CASSANDRA-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181888#comment-13181888
]
Jonathan Ellis commented on CASSANDRA-2878:
-------------------------------------------
There's one wrinkle with doing M/R over CQL -- we need to split the input space
up into token-delineated ranges, since key order may not be partitioner order.
I see a few options:
# Add a "private" CQL thrift method that takes token ranges as well as the
query string
# Add some kind of syntax to CQL to support query-by-token, e.g., "WHERE
token(user_id) >= 2300183742897592" [here user_id is the key alias]
# Parse the CQL query in CqlRecordReader and turn it into a Thrift
get_range_slices call (which is similar to, but can't share much code with,
QueryProcessor turning CQL queries into StorageProxy calls)
# Drop the idea of adding a CqlInputFormat and just add configuration
parameters for KeyRange to ColumnFamilyInputFormat
None of these are awesome. 4 is probably the most straightforward, but leaves
us SOL for wide rows, while a CQL inputformat can solve that as well
(CASSANDRA-2474). 3 has the same problem of not generalizing to 2474. 2 feels
cleanest in some ways, but I've never been thrilled with adding query-by-token
to thrift either since it lends itself to abuse (CASSANDRA-1978). Which brings
us back to 1, but then we're stuck supporting that "hack" post-Thrift as well
(CASSANDRA-2478).
Thoughts?
> Allow CQL-based map/reduce
> --------------------------
>
> Key: CASSANDRA-2878
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2878
> Project: Cassandra
> Issue Type: New Feature
> Components: Hadoop
> Reporter: Mck SembWever
> Assignee: Jonathan Ellis
> Priority: Minor
> Fix For: 1.1
>
>
> Currently, when running a MapReduce job against data in a Cassandra data
> store, it reads through all the data for a particular ColumnFamily. This
> could be optimized to only read through those rows that have to do with the
> query.
> Adding CQL support to m/r will allow using an index more simply than trying
> to cram support for more parameters into the job configuration.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira