At Metaweb, we did a lot of comparisons between streaming (using Python)
and native Java, and in general streaming performance was not much
slower than the native java -- most of the slowdown was from Python
being a slow language.
The main problems with streaming apps that we found are that they are
hard to write and there are many ways that you can make simple mistakes
in streaming that slow down performance.
We've been experimenting with embedding JavaScript (Rhino) and Jython
for writing jobs, and have found that performance is good and the apps
are much easier to write. The tight Java integration means that
performance bottlenecks get rewritten in Java with little sacrifice to
development speed. One of these days we'll open source these frameworks.
Parand Darugar wrote:
Travis Brady wrote:
This brings up two interesting issues:
1. Hadoop streaming is a potentially very powerful tool, especially for
those of us who don't work in Java for whatever reason
2. If Hadoop streaming is "at best a jury rigged solution" then that
should
be made known somewhere on the wiki. If it's really not supposed to be
used, why is it provided at all?
A set of reasonable performance tests and results would be very
helpful in helping people decide whether to go with streaming or not.
Hopefully we can get some numbers from this thread and publish them?
Anyone else compared streaming with native java?
Best,
Parand