On Friday 25 July 2008 15:18:24 James Moore wrote: > On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote: > > Why dont you use hadoop streaming? > > I think that's more of a broader question - why doesn't everyone use > streaming? > > There's no real difference between doing Hadoop in > Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many > languages that run on the JVM. There's not language-specific reason > to pick streaming over a native implementation if you're working in a > language that has a JVM implementation. I'm working on a Ruby > interface just because I think there's a space for a nice DSL for > setting up Hadoop and running tasks that's more pleasant for people > used to writing Ruby than the current idioms.
Well, there are reasons to go for streaming. It's an acceptable interface. For many non-Java developers it's by far a nicer interface then trying to use the Java APIs, that are by scripting language standards clunky at best. For a comparison, I managed to write a tar implementation that reads/writes to S3 in cpython/boto in less than a day. hdfstar, that is a Jython script took me a week. Some time was clearly spent debugging the Java/Jython setup and fixing tarfile.py, but most time was spent dealing with the Java APIs to read/write HDFS files. Note please that I did not know S3/boto before either. And note, that I can read Java quite fine. (Actually, I can even write it quite fine, if forced by an employer :-P ) So if streaming.jar is enough for a given usecase, use it. As a side benefit you get a program that you can use differently, without hadoop. Andreas > > Streaming is great for things that don't run on a JVM - Erlang, > Haskell, Smalltalk, etc. > > If you're streaming, though, you loose all the flexibility of Hadoop. > You get line-oriented text in and out, and that's about it. But if > you want all the Hadoop features, you're going to want to go native, > be it in Ruby, Scala, Java, or whatever your language of choice is. > > Streaming is powerful, and huge numbers of solutions of the form > "my_code < data > output" have solved many, many problems over the > years. If your problem fits in the streaming space, then you should > consider it. And I think that's a language-neutral statement - just > because your solution is in Java doesn't mean you should bother > hooking it up into a native Hadoop app.
signature.asc
Description: This is a digitally signed message part.
