[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994608#comment-14994608
 ] 

Colin Patrick McCabe commented on MAPREDUCE-6538:
-------------------------------------------------

bq. \[The Java client APIs provide significant advantages that neither 
streaming nor pipes provide\]... is a false statement. Partitioning, for 
example, can't be done natively in streaming code but can in pipes. In 
streaming, you can only provide a Java class.

I agree that supporting partitioning is an advantage of pipes that streaming 
doesn't have.  There are still advantages that the Java API has over both, 
which is the point I was making.  I also don't see a fundamental reason why 
streaming couldn't be extended to provide this, which would be beneficial to 
languages like Python that can't use pipes.

bq. Correct. Because if the code is being written MR in C++, why would one use 
the less functional streaming API? If one believes that MR jobs consist of 
nothing but reading and writing KVs I could see that, but there's a lot more 
going on under the hood in more advanced jobs. That functionality is just 
flat-out not available in streaming.

I would personally prefer to either use a JVM language or deal with the simple 
and clean stdout/stdin paradigm of streaming, than deal with pipes.

There is a lot of technical debt in pipes.  It is hardcoded to output log 
messages to stderr using {{fprintf}}.  Keys and values need to be serialized to 
C++ {{std::string}} objects.  It doesn't follow the same coding style as the 
other C++ code in Hadoop.  It builds at {{\-O0}} and doesn't generate a 
{{.so}}, just a {{.a}}.  There is no unit test suite, no concept of what the 
API is or how it's allowed to change over time, and very little documentation.

[~aw], since you are committed to keeping pipes around, can you please file 
follow-on JIRAs for fixing these issues and link them to this JIRA?  I will 
close this as WONTFIX.  We can always revisit this later if things change.

> Deprecate hadoop-pipes
> ----------------------
>
>                 Key: MAPREDUCE-6538
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6538
>             Project: Hadoop Map/Reduce
>          Issue Type: Wish
>          Components: pipes
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>
> Development appears to have stopped on hadoop-pipes upstream for the last few 
> years, aside from very basic maintenance.  Hadoop streaming seems to be a 
> better alternative, since it supports more programming languages and is 
> better implemented.
> There were no responses to a message on the mailing list asking for users of 
> Hadoop pipes... and in my experience, I have never seen anyone use this.  We 
> should remove it to reduce our maintenance burden and build times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to