[
https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294699#comment-13294699
]
Brandyn White commented on CASSANDRA-3134:
------------------------------------------
I agree. I've been working with HBase more than Cassandra recently and that is
what I did there (custom InputFormat, same streaming jar). This is certainly
the best way to go for Cassandra also.
> Patch Hadoop Streaming Source to Support Cassandra IO
> -----------------------------------------------------
>
> Key: CASSANDRA-3134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3134
> Project: Cassandra
> Issue Type: New Feature
> Components: Hadoop
> Reporter: Brandyn White
> Priority: Minor
> Labels: hadoop, hadoop_examples_streaming
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> (text is a repost from
> [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497])
> I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python
> library and I'm interested in taking another stab at streaming support.
> Hadoopy and Dumbo both use the TypedBytes format that is in CDH for
> communication with the streaming jar. A simple way to get this to work is
> modify the streaming code (make hadoop-cassandra-streaming.jar) so that it
> uses the same TypedBytes communication with streaming programs, but the
> actual job IO is using the Cassandra IO. The user would have the exact same
> streaming interface, but the user would specify the keyspace, etc using
> environmental variables.
> The benefits of this are
> 1. Easy implementation: Take the cloudera-patched version of streaming and
> change the IO, add environmental variable reading.
> 2. Only Client side: As the streaming jar is included in the job, no server
> side changes are required.
> 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this
> would require the same simple fixup as any other Hadoop job.
> 4. The TypedBytes format supports all of the necessary Cassandara types
> (https://issues.apache.org/jira/browse/HADOOP-5450)
> 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only
> need to know the path of this new streaming jar
> 6. No need for avro
> The negatives of this are
> 1. Duplicative code: This would be a dupe and patch of the streaming jar.
> This can be stored itself as a patch.
> 2. I'd have to check but this solution should work on a stock hadoop (cluster
> side) but it requires TypedBytes (client side) which can be included in the
> jar.
> I can code this up but I wanted to get some feedback from the community first.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira