Just as an aside - there is probably a general perception that streaming is really slow (at least I had it).
The last I did some profiling (in 0.15) - the primary overheads from streaming came from the scripting language (python is ssslllloooow). For an insanely fast script (bin/cat), I saw significant overheads in java function/data path that drowned out streaming overheads by huge margin (lot of those overheads have been fixed in recent versions - thanks to the hadoop team). Writing a c/c++ streaming program is pretty good way of getting good performance (and some performance sensitive apps in our environment ended up doing just that). Agreed that not all hooks are available. -----Original Message----- From: James Moore [mailto:[EMAIL PROTECTED] Sent: Friday, July 25, 2008 6:18 AM To: [email protected] Subject: Re: Bean Scripting Framework? On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote: > Why dont you use hadoop streaming? I think that's more of a broader question - why doesn't everyone use streaming? There's no real difference between doing Hadoop in Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many languages that run on the JVM. There's not language-specific reason to pick streaming over a native implementation if you're working in a language that has a JVM implementation. I'm working on a Ruby interface just because I think there's a space for a nice DSL for setting up Hadoop and running tasks that's more pleasant for people used to writing Ruby than the current idioms. Streaming is great for things that don't run on a JVM - Erlang, Haskell, Smalltalk, etc. If you're streaming, though, you loose all the flexibility of Hadoop. You get line-oriented text in and out, and that's about it. But if you want all the Hadoop features, you're going to want to go native, be it in Ruby, Scala, Java, or whatever your language of choice is. Streaming is powerful, and huge numbers of solutions of the form "my_code < data > output" have solved many, many problems over the years. If your problem fits in the streaming space, then you should consider it. And I think that's a language-neutral statement - just because your solution is in Java doesn't mean you should bother hooking it up into a native Hadoop app. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
