[
https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096811#comment-13096811
]
Brandyn White commented on CASSANDRA-3134:
------------------------------------------
So the only requirement is that it have TypedBytes support. I personally use
CDH but I believe it was accepted upstream in [Hadoop
.21|http://hadoop.apache.org/common/docs/r0.21.0/changes.html]. So this would
work in Vanilla .21 and CDH 2/3.
> Patch Hadoop Streaming Source to Support Cassandra IO
> -----------------------------------------------------
>
> Key: CASSANDRA-3134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3134
> Project: Cassandra
> Issue Type: New Feature
> Components: Hadoop
> Reporter: Brandyn White
> Priority: Minor
> Labels: hadoop, hadoop_examples_streaming
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> (text is a repost from
> [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497])
> I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python
> library and I'm interested in taking another stab at streaming support.
> Hadoopy and Dumbo both use the TypedBytes format that is in CDH for
> communication with the streaming jar. A simple way to get this to work is
> modify the streaming code (make hadoop-cassandra-streaming.jar) so that it
> uses the same TypedBytes communication with streaming programs, but the
> actual job IO is using the Cassandra IO. The user would have the exact same
> streaming interface, but the user would specify the keyspace, etc using
> environmental variables.
> The benefits of this are
> 1. Easy implementation: Take the cloudera-patched version of streaming and
> change the IO, add environmental variable reading.
> 2. Only Client side: As the streaming jar is included in the job, no server
> side changes are required.
> 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this
> would require the same simple fixup as any other Hadoop job.
> 4. The TypedBytes format supports all of the necessary Cassandara types
> (https://issues.apache.org/jira/browse/HADOOP-5450)
> 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only
> need to know the path of this new streaming jar
> 6. No need for avro
> The negatives of this are
> 1. Duplicative code: This would be a dupe and patch of the streaming jar.
> This can be stored itself as a patch.
> 2. I'd have to check but this solution should work on a stock hadoop (cluster
> side) but it requires TypedBytes (client side) which can be included in the
> jar.
> I can code this up but I wanted to get some feedback from the community first.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira