[ 
https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329333#comment-14329333
 ] 

Colin Patrick McCabe commented on HDFS-6994:
--------------------------------------------

bq. [~wheat9] wrote: I'm concerned about this. What are the guarantees of the 
APIs for the releases? Are the APIs / ABIs going to be compatible once we 
remove exceptions in later versions? Can the user simply do a drop-in 
replacement to upgrade? For the part of libhdfs binding the answer might be 
yes, but my general impression is no due to the complexity of SEH on Windows 
and various quirks on the implementation of the C++ exceptions.

The current plan is to expose only the existing {{libhdfs.h}} API for now.  
Since this is a C API, it does not include exceptions, clearly.  So I do not 
think there will be a problem with this.

What we are discussing is eliminating the use of exceptions internally.  Since 
this happens at a level which is not visible to users, it can certainly be done 
later if we want to.  However, I would like to see it fixed sooner rather than 
later.  it is important to stick to a consistent coding style, and we want to 
work on the robustness.

I have proposed a C\+\+ API for libhdfs and libhdfs3 at 
https://issues.apache.org/jira/browse/HDFS-7207.  I would welcome more 
discussion there.  Note that my API does not use exceptions and does not 
require C\+\+11 (although it can make use of C\+\+ features if it is available.)

bq. [asynchronous api discussion]

If you look at a high-performance HDFS client like Impala or HAWQ, they are 
fine with synchronous APIs.  Why?  Well, most of the time your read performance 
is limited by the bandwidth of the local disks (high performance clients always 
try to do local reads, and use short-circuit and mmap if possible).  A local 
hard disk can't handle more than maybe 100 seeks a second, and the more seeks 
you do, the lower your bandwidth will be.

There is also the CPU aspect: what are you doing with the data?  Sure you can 
have 10,000 async requests going with 1 thread, but if that thread is actually 
doing anything with the data, you can cut a few zeroes off of that.  And then 
you're back to an amount of concurrent reads that can be comfortably done 
synchronously.

Async APIs work best for cases where you are doing very, very little processing 
on each request.  So an async web server like ngnix, which is written in highly 
optimized straight C (no \+\+) can squeeze a few more pages per second out of 
reducing its thread count.  But in a DB it's tougher (and as you mentioned, it 
also makes the code much more complex).

So while we should probably consider an async client at some point, I think it 
is much lower priority than other things (like finishing the existing native 
client and merging it)

> libhdfs3 - A native C/C++ HDFS client
> -------------------------------------
>
>                 Key: HDFS-6994
>                 URL: https://issues.apache.org/jira/browse/HDFS-6994
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs-client
>            Reporter: Zhanwei Wang
>            Assignee: Zhanwei Wang
>         Attachments: HDFS-6994-rpc-8.patch, HDFS-6994.patch
>
>
> Hi All
> I just got the permission to open source libhdfs3, which is a native C/C++ 
> HDFS client based on Hadoop RPC protocol and HDFS Data Transfer Protocol.
> libhdfs3 provide the libhdfs style C interface and a C++ interface. Support 
> both HADOOP RPC version 8 and 9. Support Namenode HA and Kerberos 
> authentication.
> libhdfs3 is currently used by HAWQ of Pivotal
> I'd like to integrate libhdfs3 into HDFS source code to benefit others.
> You can find libhdfs3 code from github
> https://github.com/PivotalRD/libhdfs3
> http://pivotalrd.github.io/libhdfs3/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to