[jira] [Commented] (CASSANDRA-3134) Patch Hadoop Streaming Source to Support Cassandra IO

Brandyn White (JIRA) Wed, 13 Jun 2012 15:51:45 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294699#comment-13294699
 ]


Brandyn White commented on CASSANDRA-3134:
------------------------------------------

I agree.  I've been working with HBase more than Cassandra recently and that is 
what I did there (custom InputFormat, same streaming jar).  This is certainly 
the best way to go for Cassandra also.
                
> Patch Hadoop Streaming Source to Support Cassandra IO
> -----------------------------------------------------
>
>                 Key: CASSANDRA-3134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3134
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Hadoop
>            Reporter: Brandyn White
>            Priority: Minor
>              Labels: hadoop, hadoop_examples_streaming
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> (text is a repost from 
> [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497])
> I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python 
> library and I'm interested in taking another stab at streaming support. 
> Hadoopy and Dumbo both use the TypedBytes format that is in CDH for 
> communication with the streaming jar. A simple way to get this to work is 
> modify the streaming code (make hadoop-cassandra-streaming.jar) so that it 
> uses the same TypedBytes communication with streaming programs, but the 
> actual job IO is using the Cassandra IO. The user would have the exact same 
> streaming interface, but the user would specify the keyspace, etc using 
> environmental variables.
> The benefits of this are
> 1. Easy implementation: Take the cloudera-patched version of streaming and 
> change the IO, add environmental variable reading.
> 2. Only Client side: As the streaming jar is included in the job, no server 
> side changes are required.
> 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this 
> would require the same simple fixup as any other Hadoop job.
> 4. The TypedBytes format supports all of the necessary Cassandara types 
> (https://issues.apache.org/jira/browse/HADOOP-5450)
> 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only 
> need to know the path of this new streaming jar
> 6. No need for avro
> The negatives of this are
> 1. Duplicative code: This would be a dupe and patch of the streaming jar. 
> This can be stored itself as a patch.
> 2. I'd have to check but this solution should work on a stock hadoop (cluster 
> side) but it requires TypedBytes (client side) which can be included in the 
> jar.
> I can code this up but I wanted to get some feedback from the community first.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3134) Patch Hadoop Streaming Source to Support Cassandra IO

Reply via email to