On Jul 25, 2008, at 3:53 PM, Joydeep Sen Sarma wrote:
Just as an aside - there is probably a general perception that
streaming
is really slow (at least I had it).
The last I did some profiling (in 0.15) - the primary overheads from
streaming came from the scripting language (python is
ssslllloooow). For
an insanely fast script (bin/cat), I saw significant overheads in java
function/data path that drowned out streaming overheads by huge margin
(lot of those overheads have been fixed in recent versions - thanks to
the hadoop team).
Writing a c/c++ streaming program is pretty good way of getting good
performance (and some performance sensitive apps in our environment
ended up doing just that).
Agreed that not all hooks are available.
Hadoop Pipes?
Arun
-----Original Message-----
From: James Moore [mailto:[EMAIL PROTECTED]
Sent: Friday, July 25, 2008 6:18 AM
To: [email protected]
Subject: Re: Bean Scripting Framework?
On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]>
wrote:
Why dont you use hadoop streaming?
I think that's more of a broader question - why doesn't everyone use
streaming?
There's no real difference between doing Hadoop in
Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many
languages that run on the JVM. There's not language-specific reason
to pick streaming over a native implementation if you're working in a
language that has a JVM implementation. I'm working on a Ruby
interface just because I think there's a space for a nice DSL for
setting up Hadoop and running tasks that's more pleasant for people
used to writing Ruby than the current idioms.
Streaming is great for things that don't run on a JVM - Erlang,
Haskell, Smalltalk, etc.
If you're streaming, though, you loose all the flexibility of Hadoop.
You get line-oriented text in and out, and that's about it. But if
you want all the Hadoop features, you're going to want to go native,
be it in Ruby, Scala, Java, or whatever your language of choice is.
Streaming is powerful, and huge numbers of solutions of the form
"my_code < data > output" have solved many, many problems over the
years. If your problem fits in the streaming space, then you should
consider it. And I think that's a language-neutral statement - just
because your solution is in Java doesn't mean you should bother
hooking it up into a native Hadoop app.
--
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com