Has anyone benchmark the performance difference of using Hadoop ? 1) Java vs C++ 2) Java vs Streaming
>From looking at the Hadoop architecture, since TaskTracker will fork a >separate process anyway to run the user supplied map() and reduce() function, >I don't see the performance overhead of using Hadoop Streaming (of course the >efficiency of the chosen script will be a factor but I think this is >orthogonal). On the other hand, I see a lot of benefits of using Streaming, >including ... 1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better). In fact, I can even chosen Erlang at the map() and Prolog at the reduce(). Mix and match can optimize me more. 2) I can pick the language that I am familiar with, or one that I like. 3) Easy to switch to another language in a fine-grain incremental way if I choose to do so in future. Even if I am a Java programmer, I still can write a Main() method to take the standard in and standard out data and I don't see I am losing much by doing that. The benefit is my code can be easily moved to another language in future. Am I missing something here ? or is the majority of Hadoop applications written in Hadoop Streaming ? Rgds, Ricky