[jira] [Commented] (HADOOP-10389) Native RPCv9 client

Colin Patrick McCabe (JIRA) Tue, 10 Jun 2014 18:04:30 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027288#comment-14027288
 ]


Colin Patrick McCabe commented on HADOOP-10389:
-----------------------------------------------

bq. What make me concerned is that the code has to bring in a lot more 
dependency in plain C, which has a high cost on maintenance

Currently, the libraries we depend on are: {{libuv}}, for portability 
primitives, {{protobuf-c}}, for protobuf functionality, {{expat}}, for XML 
parsing, and {{liburiparser}}, for parsing URIs.  None of that functionality is 
provided by the C++ standard library, so your statement is false.

bq. For example, this patch at least contains implementation of linked list, 
splay tress, hash tables, and rb trees. There are a lot of overheads on 
implementing, reviewing and testing the code.

A lot of this code is not new.  For example, we were using {{tree.h}} (which 
implements splay trees and rb trees), previously in libhdfs.  The maintenance 
burden was not high.  In fact, it was zero, because we never had to fix a bug 
in {{tree.h}}.  So once again, your statement is just false.

{{htable.c}} got a review because it is new code.  I would hardly call 
reviewing new code a "maintenance burden."  And anyway, there is a standard C 
way to use hash tables... the {{hcreate_r}}, {{hsearch_r}}, and {{hdestroy}} 
functions.  We would like to use the standard way, but Windows doesn't 
implement these functions.

bq. For example, do you considering supporting filenames in unicode? That way I 
think libicu might need to be brought into the picture.

First of all, the question of whether we should use libicu is independent of 
the question of whether we should use C\+\+.  libicu has a C interface, and the 
standard C\+\+ libraries and runtime don't provide any unicode functionality 
beyond what the standard C libraries provide.

Second of all, I see no reason to use libicu.  All the strings we are dealing 
with are UTF-8 supplied to and from protobuf.  This means that they are 
null-terminated and can be printed and handled with existing string functions.  
libicu might come into the picture if we wanted to start normalizing unicode 
strings or using wide character strings.  But we don't need or want to do that.

bq. It looks to me that it is much more compelling to implement the code in a 
more modern language, say, c++11, where much of the headache right now is taken 
away by a mature standard library.

C++ first came on the scene in 1983.  That is 31 years ago.  C++ may be a lot 
of things, but "modern" isn't one of them.  I was a C++ programmer for 10 
years.  I know the language about as well as anyone can.  I specifically chose 
C for this project because of a few things.

Firstly, the challenge of maintaining a consistent C++ coding style is very, 
very large.  This is true even when everyone is a professional C++ programmer 
working under the same roof.  For a project like Hadoop, where C/C++ is not 
everyone's first language, the challenge is just unsupportable.  The C++ 
learning curve is just much higher than C.  You have to know everything you 
have to know for C, plus a lot of very tricky things that are unique to C++.

There are a lot of contentious issues in the community like use exceptions, or 
don't use exceptions?  Use global constructors, or don't use global 
constructors?  Use boost, or don't use boost?  Use C++0x / C++11 / C++14 or use 
some older standard?  Use runtime type information ({{dynamic_cast}}, 
{{typeof}}), or don't use runtime type information?  Operator overloading, or 
no operator overloading?

There are reasonable arguments for each of these positions.  For example, 
exceptions harm performance (because of the need to maintain data to do stack 
unwinding.  See here: 
http://preshing.com/20110807/the-cost-of-enabling-exception-handling/.  That's 
just if you don't use them... if you do use them, exceptions turn out to be a 
lot slower than return codes.  They also can make code difficult to follow.  
C++ doesn't have checked exceptions, so you can never really know what any 
function will throw.  For this reason, some fairly smart people at Google have 
decided to ban exceptions from their coding standard.  This, in turn, means 
that it's difficult for libraries to throw exceptions, since open source 
projects using the Google Coding standard (and there are a lot of them) can't 
deal with exceptions.  Of course, without exceptions, certain things in C++ are 
very hard to do.  (By the way, I'm not interested in having the argument 
for/against exceptions here, just in noting that there is huge fragmentation 
here and reasonable people on both sides.)

A similar story could be told about all the other choices.  The net effect is 
that we have to police a very large set of arbitrary style decisions that just 
wouldn't come up at all if we just used C.

C\+\+ library APIs have binary compatibility issues.  A lot of them.  Just take 
a look at 
http://techbase.kde.org/Policies/Binary_Compatibility_Issues_With_C++.  Again, 
how are we going to ensure that everyone follows these rules?  It's nearly 
impossible.  Considering the number of issues we've had maintaining API 
compatibility in Java, with Java's much simpler rules, this is just a 
deal-breaker.  Whereas with C, the rules for maintaining binary compatibility 
are very simple.

C is available on every platform out there, even AIX.  C\+\+11 is only 
available on a subset of those platforms.  This is another advantage of plain 
old C.

But more importantly, it's easy to bind other higher-level languages to C than 
it is to C\+\+.  For example, in Python you can use ctypes to easily wrap a C 
library.  https://docs.python.org/2/library/ctypes.html.  Do you want to use 
ctypes with C\+\+?  Then you're out of luck.  
http://stackoverflow.com/questions/1615813/how-to-use-c-classes-with-ctypes.  A 
similar story could be told about golang, and most other high-level languages.  
You have to write a lot of boilerplate to wrap C\+\+, and almost none for C.

If we were writing a new daemon or something, then I might consider C\+\+, even 
C\+\+11.  Yes, C\+\+11 added some good things.  {{auto}} was a good idea 
(borrowed from golang or someplace), and move constructors are nice.  But none 
of it addresses the problems above, and all of it just adds more complexity for 
people to master.  What we are writing is just a client, and it's not that 
thick.  Especially the YARN client, which just makes some RPCs and that's it.  
And the code is nearly done.

I'm not interested in having a language flamewar here.  C has some advantages, 
and C\+\+ has another set.  For this particular project, the former outweigh 
the latter.  I'm very familiar with C\+\+ and I don't need a lecture on its 
advantages, having been a user for a decade.

If you are interested in writing a C++ interface for libhdfs or libyarn, then 
by all means do that.  I think this interface should be in a header file only, 
to avoid the binary compatibility issues I mentioned earlier.  Since the header 
file would be compiled by each client, we would be free to change it whenever 
we liked without worrying about binary compatibility.

> Native RPCv9 client
> -------------------
>
>                 Key: HADOOP-10389
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10389
>             Project: Hadoop Common
>          Issue Type: Sub-task
>    Affects Versions: HADOOP-10388
>            Reporter: Binglin Chang
>            Assignee: Colin Patrick McCabe
>         Attachments: HADOOP-10388.001.patch, HADOOP-10389.002.patch, 
> HADOOP-10389.004.patch, HADOOP-10389.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HADOOP-10389) Native RPCv9 client

Reply via email to