Patch Hadoop Streaming Source to Support Cassandra IO
-----------------------------------------------------

                 Key: CASSANDRA-3134
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3134
             Project: Cassandra
          Issue Type: New Feature
          Components: Hadoop
            Reporter: Brandyn White
            Priority: Minor


(text is a repost from 
[CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497])

I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python library 
and I'm interested in taking another stab at streaming support. Hadoopy and 
Dumbo both use the TypedBytes format that is in CDH for communication with the 
streaming jar. A simple way to get this to work is modify the streaming code 
(make hadoop-cassandra-streaming.jar) so that it uses the same TypedBytes 
communication with streaming programs, but the actual job IO is using the 
Cassandra IO. The user would have the exact same streaming interface, but the 
user would specify the keyspace, etc using environmental variables.

The benefits of this are
1. Easy implementation: Take the cloudera-patched version of streaming and 
change the IO, add environmental variable reading.
2. Only Client side: As the streaming jar is included in the job, no server 
side changes are required.
3. Simple maintenance: If the Hadoop Cassandra interface changes, then this 
would require the same simple fixup as any other Hadoop job.
4. The TypedBytes format supports all of the necessary Cassandara types 
(https://issues.apache.org/jira/browse/HADOOP-5450)
5. Compatible with existing streaming libraries: Hadoopy and dumbo would only 
need to know the path of this new streaming jar
6. No need for avro

The negatives of this are
1. Duplicative code: This would be a dupe and patch of the streaming jar. This 
can be stored itself as a patch.
2. I'd have to check but this solution should work on a stock hadoop (cluster 
side) but it requires TypedBytes (client side) which can be included in the jar.

I can code this up but I wanted to get some feedback from the community first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to