On Friday 25 July 2008 15:18:24 James Moore wrote:
> On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote:
> > Why dont you use hadoop streaming?
>
> I think that's more of a broader question - why doesn't everyone use
> streaming?
>
> There's no real difference between doing Hadoop in
> Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many
> languages that run on the JVM.  There's not language-specific reason
> to pick streaming over a native implementation if you're working in a
> language that has a JVM implementation.  I'm working on a Ruby
> interface just because I think there's a space for a nice DSL for
> setting up Hadoop and running tasks that's more pleasant for people
> used to writing Ruby than the current idioms.

Well, there are reasons to go for streaming. It's an acceptable interface.

For many non-Java developers it's by far a nicer interface then trying to use 
the Java APIs, that are by scripting language standards clunky at best.

For a comparison, I managed to write a tar implementation that reads/writes to 
S3 in cpython/boto in less than a day. hdfstar, that is a Jython script took 
me a week. Some time was clearly spent debugging the Java/Jython setup and 
fixing tarfile.py, but most time was spent dealing with the Java APIs to 
read/write HDFS files.

Note please that I did not know S3/boto before either. And note, that I can 
read Java quite fine. (Actually, I can even write it quite fine, if forced by 
an employer :-P )

So if streaming.jar is enough for a given usecase, use it. As a side benefit 
you get a program that you can use differently, without hadoop.

Andreas



>
> Streaming is great for things that don't run on a JVM - Erlang,
> Haskell, Smalltalk, etc.
>
> If you're streaming, though, you loose all the flexibility of Hadoop.
> You get line-oriented text in and out, and that's about it.  But if
> you want all the Hadoop features, you're going to want to go native,
> be it in Ruby, Scala, Java, or whatever your language of choice is.
>
> Streaming is powerful, and huge numbers of solutions of the form
> "my_code < data > output" have solved many, many problems over the
> years.  If your problem fits in the streaming space, then you should
> consider it.   And I think that's a language-neutral statement - just
> because your solution is in Java doesn't mean you should bother
> hooking it up into a native Hadoop app.


Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to