Also bear in mind that there is a kind of detour involved, in the sense that a 
pipes map must send key,value data back to the Java process and then to reduce 
(more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be 
faster. 
Would be interested to know if the community has any experience with HCE 
performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a 
> separate process that is running whatever you want it to run.  The JVM that 
> is running hadoop then communicates with this process to send the data over 
> and get the processing results back.  The difference between streaming and 
> pipes is that streaming uses stdin/stdout for this communication so 
> preexisting processing like grep, sed and awk can be used here.  Pipes uses a 
> custom protocol with a C++ library to communicate.  The C++ library is tagged 
> with SWIG compatible data so that it can be wrapped to have APIs in other 
> languages like python or perl.
> 
> I am not sure what the performance difference is between the two, but in my 
> own work I have seen a significant performance penalty from using either of 
> them, because there is a somewhat large overhead of sending all of the data 
> out to a separate process just to read it back in again.
> 
> --Bobby Evans
> 
> 
> On 4/5/12 1:54 PM, "Mark question" <markq2...@gmail.com> wrote:
> 
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
> 
> Thank you,
> Mark
> 

Reply via email to